The world’s leading publication for data science, AI, and ML professionals.

How to choose the best Linear Regression model – A comprehensive guide for beginners

A gentle introduction to popular evaluation methods (R², MAE, MSE) as well as Akaike's information criterion (AIC) using real life…

Image by Element 5 Digital at Unsplash
Image by Element 5 Digital at Unsplash

If you are a beginner in Data Science or statistics with some background on linear regression and are looking for ways to evaluate your models, then this guide might be for you.

This article will discuss the following metrics for choosing the ‘best’ linear regression model: R-Squared (R²), Mean Absolute Error (MAE), Mean Squared Error (MSE), Root-Mean Square Error (RMSE), Akaike Information Criterion (AIC), and corrected variants of these that account for bias. A knowledge of linear regression will be assumed. I hope you enjoy reading this article, find it useful and learn something new 🙂


R-Squared (R²)

y = dependent variable values, y_hat = predicted values from model, y_bar = the mean of y
y = dependent variable values, y_hat = predicted values from model, y_bar = the mean of y

The R² value, also known as coefficient of determination, tells us how much the predicted data, denoted by y_hat, explains the actual data, denoted by y. In other words, it represents the strength of the fit, however it does not say anything about the model itself – it does not tell you if the model is good, whether the data you’ve chosen is biased, or even if you’ve chosen the correct modelling method¹. I will show this using examples below.

The R² value ranges from 0 to 1, with higher values denoting a strong fit, and lower values denoting a weak fit. Typically, it’s agreed that:

R² < 0.5 → Weak fit

0.5 ≤ R² ≤ 0.8 → Moderate fit

R² > 0.8 → Strong fit

Note: it is theoretically possible to have R² < 0, however those arise for visibly terrible fits and as such will not be discussed in this article.

You might be thinking, if R² does not represent how good the model is, then what does ‘strength of fit’ even mean? It means that, on average, your predicted values (y_hat) do not deviate much from your actual data (y). The examples below will illustrate this.

(Left) a quadratic model predicted linearly, (Right) a linear model predicted linearly
(Left) a quadratic model predicted linearly, (Right) a linear model predicted linearly

Both models above have predicted lines that give a ‘strong’ fit, in that they have high R² values, and also capture the small deviation of the actual data points from the fitted line. However, it is clear that, despite the left model having a higher R² value, the right one is a better model. In fact the left most model is terrible, as it fails to capture the curvature of the data. Therefore, a high R² does not mean that the fit is good or appropriate, it simply means that the deviation of actual points from the fitted points, on average, is small.

Sometimes, a model may have a low R² value, but in fact be a good model for the data. Consider the following examples:

(Left) a sinusoidal curve fit with a straight line, (Middle) a straight line with small noise fit with a purposely skewed straight line, (Right) a straight line with large noise fit with the correct line
(Left) a sinusoidal curve fit with a straight line, (Middle) a straight line with small noise fit with a purposely skewed straight line, (Right) a straight line with large noise fit with the correct line

Like the previous example, the model on the left is a terrible fit, but with a moderate ‘strength of fit’, so compared with the model on the right, one may think, solely based on the R² values, that the leftmost model is better. This is wrong, as shown by the graphs. What about the middle model? It has an R² three times that of the model on the right, and, visually, does not seem to be completely off the mark. So one might conclude that the middle model is far better than the right model?

Wrong.

The data points in the middle and right model are based off of the same line, y=x+e, where e are randomly generated errors from a normal distribution. The only difference between them, is that the error magnitudes are amplified on the rightmost graph. The middle model is worse than that on the right, because I purposely skewed it, so that it’s equation is something like this:

*y_hat = 1.25x-25**

The model on the right however, is correct: it is y_hat = x, exactly the same as the line from which the data points were generated. To confirm this visually, you can see the skew on the middle model, whereas the right model seems to be dead centre in the data points, exactly as you would expect. Therefore, a model with a low R² could still correctly predict the shape of the data, but suffers from large variance in the data.

Despite this, if the nature of the problem is to predict values, then the middle model might perform better due to the lower variation in the data points, however this does not necessarily make it a better model.

R² Summary

The R² metric gives an indication of how well a model fits your data, but is unable to explain if your model is good or not

Pros:

  1. Gives an indication as to how good the fit of the model is

Cons:

  1. Adding predictors to the model can increase the value of R² due to chance, making the results misleading (see Adjusted R²)
  2. Adding predictors to the model can cause ‘overfitting’, where the model tries to predict the ‘noise’ in the data. This decreases its ability to perform better on ‘new’ data that it hasn’t seen before².
  3. R² does not have a meaning for non-linear models

Adjusted R² (Adj. R²)

n = the number of data points in the sample, k = includes the number of variables in the model, excluding the constant term (the intercept)
n = the number of data points in the sample, k = includes the number of variables in the model, excluding the constant term (the intercept)

As mentioned previously, adding predictors to a model will cause R² to increase even if the model’s performance doesn’t improve. A solution to this, is using the Adjusted R² instead of the R² as a measure of how the model is performing.

As seen from the equation above, there are two extra variables: n and k. The former represents the number of data points in the model, whereas the latter represents the number of variables in the model, excluding the constant term.

For example, if your model is of the form:

y_hat =a0 + a1x1 + a2x2

Then you have k = 2, since you have two predictors, a1 and a2.

So why is the adjusted R-squared better than R-squared?

Consider the following two models:

y_hat = x

*y_hat = a0+a1x1+a2x²²+a3x³³+a4x⁴⁴+a5x⁵⁵+a6x⁶⁶+a7x⁷⁷**

The same data, based on y=x+e was predicted by the models, the outcome is shown below:

As shown, the R² of the left model (which has more terms) is higher than that of the right model, which would suggest that it is a better model. We know this not to be true, since the data is built upon y=x+e.

When we examine the Adjusted R² values, we see that the one for the rightmost model has remained more or less the same, whereas that of the leftmost model has changed significantly, showing the impact that increasing the number of terms can have on the R² value. In this particular instance, one might choose the leftmost model, since even after accounting for the extra terms, it has a higher adjusted R². We know that this is false, it may just be a result of the random errors. We also know from the first part of this article that a higher R² does not mean that the model is better!

We can examine this further by using the same model to fit a ‘new dataset’, so that the ‘training bias’ is removed.

We see here that the linear model has a significantly better fit than that of the polynomial model (left), with R² and Adjusted R² values comparable to that of the previous dataset. The polynomial model however, which only performed well because it ‘fit’ the errors and noise, performs terribly, with an even higher decrease in the R² when adjusted for the number of variables.

As always, with both R² and Adj. R², it’s good practice to sketch the resulting models to visually inspect that the result makes sense, and in cases where the result does not make sense, adding extra data points or using a different ‘test’ dataset might provide more insight.

Adjusted R² Summary

The adjusted R² improves upon the R² by giving insight into whether a model’s R² value is because of how good the fit is, or rather because of it’s complexity

Pros:

  1. Gives more insight into the issue of overfitting
  2. Decreases the effect of randomness on the value of R² (i.e. if it’s high because of randomness, Adjusted R² will reflect that)

Cons:

  1. Still has the other problems associated with R²

Mean Absolute Error (MAE)

n = number of points, y = actual point, y_hat = predicted point
n = number of points, y = actual point, y_hat = predicted point

The MAE is the sum of all the error magnitudes divided by the number of points, so essentially the average error.

Therefore, the lower the MAE, the less error in your model.

Mean Squared Error (MSE)

n = number of points, y = actual point, y_hat = predicted point
n = number of points, y = actual point, y_hat = predicted point

The MSE is the sum of the squares of all errors divided by the number of points. Note that, since in each instance the error is actually squared, it cannot be directly compared to the MAE, because it will always be of a higher order.

Thus, as with MAE, the lower the MSE, the less error in the model.

Root Mean Squared Error (RMSE)

n = number of points, y = actual point, y_hat = predicted point
n = number of points, y = actual point, y_hat = predicted point

RMSE is the square root of the MSE. This is in a way a more useful metric, and now since both MAE and RMSE have the same ‘order’ of error, they can be compared with each other.

As with both MAE and MSE, lower MSAE → lower error.

So, what is this like in practice?

I have two examples here.

The first is very simple, I’ve created a line y_hat = 2x +5, and one with noise, so y = 2x + 5 + e.

Here we see that the MAE and RMSE are very close to each other, both indicating that the model has a fairly low error (remember, the lower MAE or RMSE, the less error !).

But so you may be asking, what is the difference between MAE and RMSE? Why is the MAE lower?

There is an answer for this.

When we look at the equations for MAE and RMSE, we notice that RMSE has a squared term… therefore: large errors will be squared, and thus would increase the value of RMSE. As such, we can conclude that RMSE is better in capturing large errors in the data, whereas MAE simply gives the average error. Since the RMSE also sums the squares before taking an average, it will always be inherently higher than the MAE.

To see this in an example, consider this:

The orange line represents the equation y_hat = 2x + 5 that I described before… the ‘y’ however is now of the form:

*y = y + sin(x)exp(x/20) + e**

where exp() represents the exponential function (and hence we see an increases in the deviation of the points.

As you can see, the RMSE is almost twice the MAE value, because it has captured the ‘largeness’ of the errors (particularly those from x = 80 and onwards).

So you may be thinking: isn’t it better to always use RMSE ?

No.

MAE does have some advantages.

For one, we may want to treat small errors the same as large errors. For instance, suppose that you are fitting data that generally has no large errors, except for one large anomalous datapoint. If you choose your Linear Regression model based on the minimum RMSE, your model may be an overfit, since you’d be trying to capture the anomaly.

In such an instance, given that your data is generally uniform with little to no visibly large errors, choosing the regression model with the lowest MAE might be more appropriate.

In addition to this, comparing RMSE for models of different sample size becomes a bit problematic and inconsistent³.

Summary of MAE, MSE and RMSE:

MAE is the average error in the fit; MSE is the average of squared errors; RMSE is the square root of MSE, and it used for comparative purposes. RMSE penalises large errors.

Pros of MAE / RMSE:

  1. Both capture the ‘error’ in the model
  2. MAE is a ‘true’ average in that it’s a measure of the average error; RMSE is a bit more nuanced, since it is skewed by factors such as error magnitude

Cons of MAE / RMSE:

  1. MAE does not pick up on very large errors; RMSE picks up on large errors, so is sensitive to outliers that it may not want to capture
  2. Both tend to increase as the model complexity increases (i.e. susceptible to overfitting), similar to how R² increases with complexity

Note that there are also corrected variants of these, for example MSEc, where the c stands for corrected. The equation differs only in that the average is no longer 1/n, but rather 1/(n+k+1), where k is the number of predictors (excluding the intercept). This is analogous to the adjusted R², in that it punishes the model for how complex it is.


Akaike’s information criterion (AIC)

k = number of predictors (including the coefficient!), and L = the maximum log likelihood
k = number of predictors (including the coefficient!), and L = the maximum log likelihood

The AIC is a bit more difficult to explain: it’s a measure of both how well the data fits the model, and how complex it is. So, in a way, it’s a mixture of the R² and the adjusted R². What is does is penalise a model for it’s complexity, but award it for how well it fits the data.

This value is almost always negative.

Essentially, the lower the AIC (i.e. more negative), the better the model in how it fits the data, and how it avoids overfitting⁴ (remember, complexity → overfitting, so if AIC penalises complexity, then it penalises overfitting).

Let’s look at an example.

Recall the example that we used for adjusted R², where we had an insanely complex model to model a linear line with noise.

I re-ran the model, this time adding an AIC score as well: let’s see the results.

We can see multiple things here, which will be good revision too.

The R² of the left graph is higher than that of the right, but we know that the rightmost is correct. This is a symptom of R² getting bigger for more complex models.

In this instance, our adjusted R² shows that the simple model is better (remember that due to randomness, this will not always be the case, but we can still get an indication of how good the model is by measuring the difference between R² adjusted and R² → the more simple model will suffer a lower loss)

We also have the AIC to aid us: the more negative the AIC, the better fit and lack of overfit. So from the AIC parameter alone, we can conclude that the simpler model is better (that said, remember to always sketch your plots and try to reason them, don’t trust numbers only!).

Let’s see how the models behave for test data:

As predicted, the R² of the more complex model is higher. Here we also noticed that the R² adjusted is higher as well. We also have our wonderful AIC, that has once again shown that the simpler model is better.

AIC Summary:

The lower the AIC, the better the model is in terms of its fit and avoidance of overfit.

Pros:

  1. AIC is a good indicator as to the quality of the model, since it accounts for both fit, but also how how little the model overfits

Cons:

  1. Mathematically, AIC is only valid for an infinite dataset. Computationally, the error can be offset by having a very large sample size. For smaller samples, a correction factor must be added.

Conclusion:

I hope that you’ve learnt the difference parameters, their use cases and how they can be deceptive.

I want to end this article by showing you a real life example.

For this project, I was trying to predict the Maintenance Costs per unit length of a bridge (Y), as a function of its Age (X1) and Length (X2).

A graph showing the maintenance costs per unit length of bridges vs. the length
A graph showing the maintenance costs per unit length of bridges vs. the length

I’d come up with a number of different models, some of which were very complex.

Note: some of the mathematical expressions are wrong. X1 represents Age, and X2 represents Length
Note: some of the mathematical expressions are wrong. X1 represents Age, and X2 represents Length

As you can see, a number of different things are involved here, but mostly we see that the models have very similar metrics. This is where using a combination of different metrics together comes in handy, which is why it’s good to know them all, or as many as possible.

At the end, it was deemed that the worst model is the ‘quadratic’ type because it has the highest AIC and the lowest R² adjusted.

The best model was deemed to be the ‘linear’ model, because it has the highest AIC, and a fairly low R² adjusted (in fact, it is within 1% of that of model ‘poly31’ which has the highest R² adjusted).

Note to reader:

I hope that you have enjoyed reading this article, and that you have a better understanding of the metrics described. As this is my first article, I would appreciate it if you could give me feedback: what was good? what was bad? What’s missing? What could I have explained differently?

Thank you very much, and I hope that you continue your journey of learning 🙂

Sources:

  1. https://www.investopedia.com/terms/r/r-squared.asp#:~:text=R-squared%20(R2),variables%20in%20a%20regression%20model.
  2. https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/#:~:text=Overfitting%20refers%20to%20a%20model%20that%20models%20the%20training%20data%20too%20well.&text=This%20means%20that%20the%20noise,the%20models%20ability%20to%20generalize.
  3. https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d
  4. https://towardsdatascience.com/the-akaike-information-criterion-c20c8fd832f2

All images provided by author unless stated otherwise.


Related Articles