Bias, Variance, and Regularization in Linear Regression: Lasso, Ridge, and Elastic Net — Differences and uses

Published in

Towards Data Science

8 min readAug 22, 2019

Regression is an incredibly popular and common machine learning technique. Often the starting point in learning machine learning, linear regression is an intuitive algorithm for easy-to-understand problems. It can generally be used whenever you’re trying to predict a continuous variable (a variable that can take any value in some numeric range), linear regressions and its relatives are often strong options, and are almost always the best place to start.

Linear Regression

This blog assumes a functional knowledge of ordinary least squares (OLS) linear regression. You can read more about OLS linear regression here, here, or here.

Bias-Variance Tradeoff

A big part of building the best models in machine learning deals with the bias-variance tradeoff. Bias refers to how correct (or incorrect) the model is. A very simple model that makes a lot of mistakes is said to have high bias. A very complicated model that does well on its training data is said to have low bias. Negatively correlated with bias is the variance of a model, which describes how much a prediction could potentially vary if one of the predictors changes slightly. In the simple model mentioned above, the simplicity of the model makes its predictions change slowly with predictor value, so it has low variance. On the other hand, our complicated, low bias model likely fits the training data very well and so predictions vary wildly as predictor values change slightly. This means this model has high variance, and it will not generalize to new/unseen data well.

The low-bias/high-variance model exhibits what is called overfitting, in which the model has too many terms and explains random noise in the data on top of the overall trend. This causes it to perform poorly on data the model has not seen before. The high-bias/low-variance model exhibits what is called underfitting, in which the model is too simple/has too few terms to properly describe the trend seen in the data. Again, the model will struggle on new data. Neither of these model types is ideal, we would like to reach some middle ground where we have the proper number of terms to describe the trend without fitting to the noise. We therefore need some sort of feature selection in which predictors with no relationship with the dependent variable are not influential in the final model.

The bias-variance tradeoff is visualized above. The total error of the model is composed of three terms: the (bias)², the variance, and an irreducible error term. As we can see in the graph, our optimal solution in which total error is minimized is at some intermediate model complexity, where neither bias nor variance is high.

Weaknesses of OLS Linear Regression

Linear regression finds the coefficient values that maximize R²/minimize RSS. But this may not be the best model, and will give a coefficient for each predictor provided. This includes terms with little predictive power. This results in a high-variance, low bias model. We therefore have the potential to improve our model by trading some of that variance with bias to reduce our overall error. This trade comes in the form of regularization, in which we modify our cost function to restrict the values of our coefficients. This allows us to trade our excessive variance for some bias, potentially reducing our overall error.

Lasso

Lasso (sometimes stylized as LASSO or lasso) adds an additional term to the cost function, adding the sum of the coefficient values (the L-1 norm) multiplied by a constant lambda. This additional term penalizes the model for having coefficients that do not explain a sufficient amount of variance in the data. It also has a tendency to set the coefficients of the bad predictors mentioned above 0. This makes Lasso useful in feature selection.

Lasso however struggles with some types of data. If the number of predictors (p) is greater than the number of observations (n), Lasso will pick at most n predictors as non-zero, even if all predictors are relevant. Lasso will also struggle with colinear features (they’re related/correlated strongly), in which it will select only one predictor to represent the full suite of correlated predictors. This selection will also be done in a random way, which is bad for reproducibility and interpretation.

It is important to note that if lambda=0, we effectively have no regularization and we will get the OLS solution. As lambda tends to infinity, the coefficients will tend towards 0 and the model will be just a constant function.

Ridge Regression

Ridge regression also adds an additional term to the cost function, but instead sums the squares of coefficient values (the L-2 norm) and multiplies it by some constant lambda. Compared to Lasso, this regularization term will decrease the values of coefficients, but is unable to force a coefficient to exactly 0. This makes ridge regression’s use limited with regards to feature selection. However, when p > n, it is capable of selecting more than n relevant predictors if necessary unlike Lasso. It will also select groups of colinear features, which its inventors dubbed the ‘grouping effect.’

Much like with Lasso, we can vary lambda to get models with different levels of regularization with lambda=0 corresponding to OLS and lambda approaching infinity corresponding to a constant function.

Interestingly, analysis of both Lasso and Ridge regression has shown that neither technique is consistently better than the other; one must try both methods to determine which to use (Hou, Hastie, 2005).

Elastic Net

Elastic Net includes both L-1 and L-2 norm regularization terms. This gives us the benefits of both Lasso and Ridge regression. It has been found to have predictive power better than Lasso, while still performing feature selection. We therefore get the best of both worlds, performing feature selection of Lasso with the feature-group selection of Ridge.

Elastic Net comes with the additional overhead of determining the two lambda values for optimal solutions.

Quick Example

Using the Boston Housing Dataset available in sklearn, we will examine the results of all 4 of our algorithms. On top of this data, I scaled the data and created 5 additional ‘features’ of random noise to test each algorithm’s ability to filter out irrelevant information. I will not do any parameter tuning; I will just implement these algorithms out of the box. You can see default parameters in sklearn’s documentation. (Linear Regression, Lasso, Ridge, and Elastic Net.) My code was largely adopted from this post by Jayesh Bapu Ahire. My code can be found on my github here.

Coefficients

We can see that linear regression assigned non-zero values to all 5 of our noise features, despite none of them having any predictive power. Interestingly, these noise features have coefficients with magnitudes similar to some of the real features in the dataset.

As we hoped, Lasso did a good job of reducing all 5 of our noise features to 0, as well as many of the real features from the dataset. This is indeed a much simpler model than given by linear regression

Ridge Regression makes a similar mistake that unregularized linear regression, assigning coefficient values to our noise features. We also see some features have very small coefficients.

Much like Lasso, Elastic Net makes the coefficients of several features 0. It however does not make as many coefficients 0 as Lasso does.

Model Performance

Mean Squared Error of the different models

For the example provided, Ridge Regression was the best model according to MSE. This might seem counter-intuitive, but it is important to remember the ridge regression model traded some variance for bias, which ultimately lead to an overall smaller error. The Lasso and Elastic Net models traded a significant amount of variance for bias, and we see that our error has increased.

Interestingly, Lasso and Elastic Net had a higher MSE than Linear Regression. But does that mean that these models are unequivocally worse? I would argue not, as the Lasso and Elastic Net models also performed feature selection, which gives us better interpretability of the models. Coefficients are interpreted as the change in dependent variable with a one unit increase in predictor value, with all other predictors held constant. In the case of complex models, the assumption of holding all other predictors constant cannot reasonably be met.

Ultimately, which model to use ultimately depends on the goal of the analysis to begin with. Are we looking for the best predictions? Then ridge regression appears best. Are we looking for interpretability, for a better understanding of the underlying data? Then Elastic Net may be the way to go. Keep in mind, I did no parameter tuning. These algorithms all have many associated parameters that can be adjusted to improve the model depending on the goals of the analysis. It is our job as data science practitioners to define these expectations (before analysis starts) to help guide us to the best solution.

Conclusions

The bias-variance tradeoff is a tradeoff between a complicated and simple model, in which an intermediate complexity is likely best.
Lasso, Ridge Regression, and Elastic Net are modifications of ordinary least squares linear regression, which use additional penalty terms in the cost function to keep coefficient values small and simplify the model.
Lasso is useful for feature selection, when our dataset has features with poor predictive power.
Ridge regression is useful for the grouping effect, in which colinear features can be selected together.
Elastic Net combines Lasso and ridge regression, potentially leading to a model that is both simple and predictive.