What is “linear” regression model?

Linearity assumption behind linear regression models and why the linearity matters

Shota Horii
Towards Data Science

--

Linear regression model is one of the simplest models in the realms of machine learning and statistics. That being said, linearity assumption behind linear regression is often understood inaccurately.

For example, following 2 models are both linear regression models, even the line on the right side doesn’t look like linear.

Figure 1. Two different linear regression models for a data set

If it’s a surprise, this article is for you. In this article, I’m trying to explain what kind of linearity is assumed behind linear regression models, and why the linearity matters.

To answer those questions, let’s see how linear regression works step-by-step in 2 simple examples.

Example1: The simplest model

Let’s start from the simplest example. Given the following training data — 3 pairs of (x, y) — (2,4), (5,1), (8,9), we want to find a function modelling the relationship between target variable y and input variable x.

Figure 2. A training data set to use in this article

Our first model is the simplest model as following.

So we try to model the relationship between x and y by this very simple linear function. A key note here is that this function is not only linear to input variable x, but also linear to parameters a, b.

Now, our purpose is to determine the value of parameter a and b fitting the training data the most.
This can be done by measuring misfit between actual target values y and model f(x) for each input x, and minimising the misfit. This misfit ( = value to minimise) is called error function.

There are many different choices of error functions but one of the simplest is the one called RSS, the sum of the squares of the errors between model f(x) for each data point x, and the corresponding target values y.

With the concept of error function, we can rephrase “determine parameters a,b fitting the training data the most” as “determine parameters a, b which minimise the error function”.

Let’s calculate the error function on our training data.

Okay, so the equation above is the error function that we want to minimise. But how can we find the value of parameters a,b that minimise this function? To get an idea, let’s visualise this function.

Figure 3. Error function of the first model

As you can instinctively guess from the 3D graph above, this function is a convex function. Optimisation (finding the minimum) of convex function is much simpler than general mathematical optimisation, as any local minimum is always the global minimum in convex function. (Very simple explanation is that convex functions have only one minimum point, such as shape of “U”) Thanks to this characteristic of convex functions, the parameters minimising the function can be found by simply solving partial differential equations as following.

Let’s solve our case.

By solving equations above, we obtain a = 5/6, b = 1/2. So, our first model (which minimises RSS) is obtained as below.

Figure 4. The first model

Example 2: Simple curvy model

Now, for the same data points, let’s think about another model like below.

As you can see, this is not a linear function to input variable x anymore. However, this is still a linear function to parameters a,b.

Let’s see how the change affects procedure of the model fitting. We’ll use the same error function as the previous example — RSS.

As seen above, equation looks very similar to the previous one. (Values of coefficients are different, but form of the equation is same.) The visualisation is below.

Figure 5. Error function of the second model

The shape also looks similar. And this is still a convex function. Secret here is that when we calculate errors with training data, input variables are given as concrete values (for example, values of are given as 2², 5² and 8² in our data set — (2,4), (5,1), (8,9) ). So no matter how complicated the form of input variables is (e.g. x, x², sin(x), log(x) etc…), values are given as just constants in the error function.

Since the error function of the second model is also a convex function, we can find the optimal parameters by exact same procedure as the previous example.

By solving equations above, we obtain a = 61/618, b = 331/206. So, our second model is obtained as below.

Figure 6. The second model

Conclusion: Linearity behind linear regression models

2 examples above are solved in completely same (and very simple) procedure even one is linear to input variable x and one is non-linear to x. The common characteristic in the 2 models is that both functions are linear to parameters a,b. This is the linearity assumed behind the linear regression models, and this is the key to the mathematical simplicity of linear regression models.

We have only seen 2 very simple models above, but in general, model’s linearity to its parameters assures that its RSS is always a convex function. This is the reason why we can get the optimal parameters by solving simple partial differential equations. And that’s why the linearity matters.

--

--