When training a machine learning model, there can be the possibility that our model performs accurately on the training set but performs poorly on the test data.
In this case, we have a problem with overfitting; in fact, overfitting occurs when our machine learning model tries to cover all the data points (or more) than the required data points present in the given dataset. Because of this, the model starts caching noise and inaccurate values present in the dataset, and all these factors reduce the efficiency and accuracy of the model.
There are a lot of methods to avoid overfitting when it occurs; in the case of Linear Regression, one method to avoid overfitting is using one of the two regularized methods that are often called l1 and l2, and we are going to understand them in this article.
1. Some Regression Notions
Let’s start by understanding the basics of regression. As Wikipedia says:
regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome’ or ‘response’ variable) and one or more independent variables (often called ‘predictors’, ‘covariates’, ‘explanatory variables’ or ‘features’).
The linear regression formula is:

Where "Yi" is (the vector of )the dependent variable (also, called "response"), while "X" is (the vector of) the independent variables (also, called "features"). Alpha and Beta are coefficients and "the game" of regression relies all on finding the "best parameters". With the "best parameters" we can find the "best line" that "best fits" the given data so that we can estimate future outcome values when we’ll give future inputs (new features values).
I wanted to stress the fact that X and Y are vectors because in Machine Learning we always have to work with multiple features, so in the case of linear regression, we can not plot a line between X and Y, as we can do in high school (or at the University) when we had just "one X" (one independent variable). In these cases, all the features contribute to the outcome in some way, so we can not just plot a graph since it would be a multivariable graph (we can do it anyway, but it is very complicated).
When overfitting occurs in linear regression, we can try to regularize our linear model; Regularization is the most used technique to penalize complex models in machine learning: it avoids overfitting by penalizing the regression coefficients that have high values. More specifically, It decreases the parameters and shrinks (simplifies) the model; its aim is to try to reduce the variance of the model, without a substantial increase in the bias.
In practice, in the regularized models (l1 and l2) we add a so-called "cost function" (or "loss function") to our linear model, and it is a measure of "how wrong" our model is in terms of its ability to estimate the relationship between X and y. The "type" of cost function differentiates l1 from l2.
2. L1 Regularization, or Lasso Regularization
Lasso (Least Absolute and Selection Operator) regression performs an L1 regularization, which adds a penalty equal to the absolute value of the magnitude of the coefficients, as we can see in the image above in the blue rectangle (lambda is the regularization parameter). This type of regularization uses shrinkage, which is where data values are shrunk towards a central point, like the mean, for example.
This type of regularization can result in sparse models with few coefficients; some coefficients, in fact, can become zero and can be eliminated from the model. This means that this type of model also performs feature selection (since some coefficients can become 0, it means that the features with coefficients 0 are eliminated) and it is to be chosen when we have "a lot" of features, since it simplifies the model. So, this model is good when we have to work with a "high number" of features.
If we look at the image at the top of this article, the absolute value of the penalty factors can be graphically represented as a (rotated) square, while the elliptical contours are the cost function. If the cost function (the ellipsis) "hits” one of the corners of the (rotated) square, then the coefficient corresponding to the axis is shrunk to zero and the relative feature is eliminated.
One problem of Lasso Regression is Multicollinearity; I mean that if there are two or more highly correlated variables then Lasso regression selects one of them randomly, which is not good for the interpretation of our model. To avoid that, I advise you to plot a correlation matrix, find the eventual highly correlated features and delete one of them (eg, if feature_1 and feature_2 are highly correlated, you can decide to delete feature_2, for example, since highly correlated variables have the same impact on the final solution).
2. L2 Regularization, or Ridge Regularization

Ridge Regression is a method of estimating the coefficients of multiple-regression models in scenarios where linearly independent variables are highly correlated.
This model adds a cost function which is the square **** value of the magnitude of the coefficients and, in fact, if we watch the first image of this article the geometric representation of the cost function, in this case, is a circle.
Unfortunately, this model does not perform the feature selection: **** it decreases the complexity of the model but does not reduce the number of independent variables, since it never leads to 0 the coefficients. This means that the final model will include all the independent variables. To avoid this problem, since Ridge has to be used when the features are highly correlated, here (more than with the Lasso model) is important to study the features with a correlation matrix and decide which to delete from the study you are performing.
Conclusions
As we have seen, regularization has to be performed when we have problems with the overfitting of our model.
With respect to the Linear Regression model, we have better use Lasso regularization when we have a lot of features, since it performs even features selection; if we have highly correlated features, we have better use the Ridge model.
Finally, if you have doubts about the difference between correlation and regression, you can read my clarifying article on this topic here.
FREE PYTHON EBOOK:
Started learning Python Data Science but struggling with it? Subscribe to my newsletter and get my free ebook: this will give you the right learning path to follow to learn Python for Data Science with hands-on experience.
Enjoyed the story? Become a Medium member for 5$/month through my referral link: I’ll earn a small commission to no additional fee to you: