Data science

Evaluation metrics & Model Selection in Linear Regression

Demystifying the most common metrics and model selection methods

NVS Yashwanth

Published in

Towards Data Science

8 min readOct 7, 2020

In this article, we shall go over the most common evaluation metrics in Linear Regression and also model selection strategies.

Residual plots — Before evaluation of a model

We know that linear regression tries to fit a line that produces the smallest difference between predicted and actual values, where these differences are unbiased as well. This difference or error is also known as residual. (Unbiased means there is no systematic pattern of distribution of the predicted values)

Residual = actual value — predicted value

e = y — ŷ

It is important to note that, before assessing or evaluating our model with evaluation metrics like R-squared, we must make use of residual plots.

Residual plots expose a biased model than any other evaluation metric. If your residual plots look normal, go ahead, and evaluate your model with various metrics.

Residual plots show the residual values on the y-axis and predicted values on the x-axis. If your model is biased you cannot trust the results.

Residual plot showing the errors corresponding to the predicted values must be randomly distributed. However, if there are any signs of a systematic pattern, then your model is biased.

But what does it mean by randomly distributed errors?
One of the assumptions of a linear regression model is that the errors must be normally distributed. This means, make sure your residuals are distributed around zero for the entire range of predicted values. Thus, if the residuals are evenly scattered, then your model may perform well.

Evaluation metrics for a linear regression model

Evaluation metrics are a measure of how good a model performs and how well it approximates the relationship. Let us look at MSE, MAE, R-squared, Adjusted R-squared, and RMSE.

Mean Squared Error (MSE)

The most common metric for regression tasks is MSE. It has a convex shape. It is the average of the squared difference between the predicted and actual value. Since it is differentiable and has a convex shape, it is easier to optimize.

Mean squared error. Image by the author.

MSE penalizes large errors.

Mean Absolute Error (MAE)

This is simply the average of the absolute difference between the target value and the value predicted by the model. Not preferred in cases where outliers are prominent.

Mean absolute error. Image by the author.

MAE does not penalize large errors.

R-squared or Coefficient of Determination

This metric represents the part of the variance of the dependent variable explained by the independent variables of the model. It measures the strength of the relationship between your model and the dependent variable.

To understand what R-square really represents let us consider the following case where we measure the error of the model with and without the knowledge of the independent variables.

Calculating regression error
When we know the values of the independent variables, we can calculate the regression error.

We know that residual is the difference between actual and predicted value. Thus, RSS (Residual sum of squares) can be calculated as follows.

Residual sum of squares. Image by the author.

Calculating squared residual error
Consider the case where we don't know the values of the independent variables. We only have the y values. With this, we calculate the mean of the y values. This point can be represented as a horizontal line. Now we calculate the sum of squared error between the mean y value and that of every other y value.

The total variation in Y can be given as a sum of squared differences of the distance between every point and the arithmetic mean of Y values. This can be termed as TSS (Total sum of squares).

Total variation in y or TSS. Image by the author.

Calculating the coefficient of determination with RSS & TSS
So we wanna find out the percentage of the total variation of Y, described by the independent variables X. If we know the percentage of the total variation of Y, that is not described by the regression line, we could just subtract the same from 1 to get the coefficient of determination or R-squared.

Image by the author.

Coefficient of determination. Image by the author.

If the data points are very close to the regression line, then the model accounts for a good amount of variance, thus resulting in a high R² value.
However do not let the R² value fool you. A good model can have low R² value and a biased model can have a high R² value as well. That is the reason you should make use of residual plots.

To summarize, the ratio of the residual error (RSS) against the total error (TSS) tells you how much of the total error remains in your regression model. Subtracting that ratio from 1 gives how much error you removed using the regression analysis. That is the R-squared error.

If R² is high (say 1), then the model represents the variance of the dependent variable.

If R² is very low, then the model does not represent the variance of the dependent variable and regression is no better than taking the mean value, i.e. you are not using any information from the other variables.

A Negative R² means you are doing worse than the mean value. It can have a negative value if the predictors do not explain the dependent variables at all such that RSS ~ TSS.

Thus R² evaluates the scattered data points about the regression line.

It is not possible to see a model with an R² of 1. In that case, all predicted values are the same as actual values and this essentially means that all values fall on the regression line.

Root Mean Squared Error (RMSE)

This is the square root of the average of the squared difference of the predicted and actual value.

R-squared error is better than RMSE. This is because R-squared is a relative measure while RMSE is an absolute measure of fit (highly dependent on the variables — not a normalized measure).

Basically, RMSE is just the root of the average of squared residuals. We know that residuals are a measure of how distant the points are from the regression line. Thus, RMSE measures the scatter of these residuals.

Root mean square error. Image by the author.

RMSE penalizes large errors.

Model selection & Subset Regression

Let me make it clear that, when you develop any model considering all of the predictors or regressor variables, it is termed as a full model.
If you drop one or more regressor variables or predictors, then this model is a subset model.

The general idea behind subset regression is to find which does better. The subset model or the full model.

We select the subset of predictors that do the best of all the available candidate predictors, such that we have the largest R² value, largest adjusted R², or the smallest MSE.

However, R² is never used for comparing the models as the value of R² increases with the increase in the number of predictors (even if these predictors do not add any value to the model).

Reason for model selection
We set out to select the best subset of predictors that explain the data well.
A simpler model that adequately explains the relationship is always a better option due to the reduced complexity. The addition of unnecessary regressor variables will add noise.

We will now look at the most common criteria and strategies for comparing and selecting the best models.

Adjusted R-squared — selection criterion

The main difference between adjusted R-squared and R-square is that R-squared describes the amount of variance of the dependent variable represented by every single independent variable, while adjusted R-squared measures variation explained by only the independent variables that actually affect the dependent variable.

Adjusted R-squared. Image by the author.

In the equation above, n is the number of data points while k is the number of variables in your model, excluding the constant.

R² tends to increase with an increase in the number of independent variables. This could be misleading. Thus, the adjusted R-squared penalizes the model for adding furthermore independent variables (k in the equation) that do not fit the model.

Mallow’s Cp — selection criterion

Mallow’s Cp measures the usefulness of the model. It tries to compute the mean squared prediction error.

Mallow’s Cp statistic. Image by te author.

Here p is the number of regressors, RSSₚ is the RSS of the model for the given p number of regressors, MSEₖ is the total MSE for k total number of predictors, and n is the sample size. This is useful when n>>k>p.

Mallow’s Cp compares the full model with a subset model. If Cp is almost equal to p (smaller the better), then the subset model is an appropriate choice.

One can plot Cp vs p for every subset model to find out the candidate model.

Exhaustive and Best subset searching

The exhaustive search looks at all the models. If there are k number of regressors, there 2ᵏ possible models. This is a very slow process.

The best subset strategy simplifies the search by finding the model that minimizes RSS for every P-value.

Stepwise Regression

Stepwise Regression is faster than Exhaustive and Best subset searching. It is an iterative procedure to choose the best model.
Stepwise regression is classified into backward and forward selection.
Backward selection starts with a full model, then step by step we reduce the regressor variables and find the model with the least RSS, largest R², or the least MSE. The variables to drop would be the ones with high p-values. It is however important to note that you cannot drop one of the levels of a categorical variable. Doing so would result in a biased model. You either drop all levels of the categorical variable or none.
Forward selection starts with a null model, then step by step we increase the regressor variables until we can no longer improve the error performance of the model. We usually pick the model with the highest adjusted R².

The Ultimate Guide to Evaluation and Selection of Models in Machine Learning - neptune.ai

On a high level, Machine Learning is the union of statistics and computation. The crux of machine learning revolves…

bit.ly

Conclusion

We discussed the most common evaluation metrics used in linear regression. We saw the metrics to use during multiple linear regression and model selection. Having gone over the use cases of most common evaluation metrics and selection strategies, I hope you understood the underlying meaning of the same. See you at the next one. Cheers.