Finding the right model complexity using Regularization

Understanding bias-variance trade off and how to achieve the balance between them using shrinkage regularization techniques.

Published in

Towards Data Science

7 min readMay 14, 2020

In a normal linear regression model, we try to predict a response variable by fitting all the predictors on a linear equation as follows:

In the above equation, we are trying to predict Y using two predictors X1 and X2 and where β1 and β2 are the coefficients that help in estimating the effect of a predictor on the response variable Y. Here ε is the irreducible error which is not in our control.

When we fit the above equation on a set of training data points with a single predictor/feature, we get a model that can be visualized as below

We can see that the above equation tries to fit a line on the data points that approximates every data point in space. As far as training data is considered, the above line is not ideal as the line does not perfectly fit each and every point. There is a distance between the point and the line being fit which we call as residuals

In the above image, the residuals are visualized as the vertical distance between the actual data point and the point on the linear equation line that is fit. Thus, we can represent the residuals using the following equation

In the above equation, we are taking the difference between the actual data point and the point fit on the line. Whenever we try to fit a linear model, we try to minimize the sum of residuals of all the points. This sum is called as residual sum of squares

We square the distance so that the positive and negative distance values are treated the same way. The below graph represents a model that minimizes the residuals of all the data points.

The above model fits all the points on a curve. The RSS of the above model is equal to 0 and the model has high complexity. Models with high complexity try to capture each and every variation in the data points. Such models are said to have high variance.

Models with high variance on train data tend to perform poorly on test data, because as the model tries to catch every variation in the train data, it also considers outliers, random data points and high leverage points which are rare and might not be present in test data. In such a case, we say that the model is over-fitting train data.

Hence, in order to improve the performance of a model on test data, we reduce the complexity of the model and reduce its variance on the train data set. When we reduce the variance of the model, we introduce error in the model. This error is called as bias.

It can be seen from the above line that the complexity of the model is reduced. The model above is not trying to catch each and every data point at the cost of increased bias. Such a model with a slightly less complexity will perform better on test data than the one we have seen above.

Thus reducing variance increases bias in our model and vice versa. This is called as bias-variance trade off. Bias variance trade off can be summarized in the graph below.

We can see above that as the model complexity increases, the variance of the model increases and the squared bias decreases. The test error also starts decreasing as the complexity increases but up to a certain point, after which it starts increasing. We have to find this optimal point while fitting a model.

We have to find the right variance while fitting a model and have to make sure that the variance is not too high.

Regularization techniques for reducing variance

Regularization techniques are essentially used to reduce the variance in a model and avoid the problem of over-fitting.

One approach to reduce the variance in the model is to shrink the coefficient estimates towards zero. We estimate the coefficient βi for estimating the effect of a predictor/feature Xi on the response variable variable Y. Thus, shrinking the value of βi towards zero will underestimate the effect of that feature on the response variable and will make the model less complex. We have two techniques of regularization that use this idea:

Ridge regularization (L2)
Lasso regularization (L1)

Ridge regularization

As we want to reduce the value of a coefficient estimate βi so that the effect of a feature on the response variable is reduced, we need a model that penalizes high values of coefficient estimates.

In a regression model, residual sum of squares (RSS) is minimized. This approach is extended in ridge regression where in addition to minimizing RSS, the squares of the coefficient estimates are also minimized.

We minimize the following in ridge regression

which is nothing but

Thus, in ridge regression we not only minimize RSS but also the square of the coefficient estimates of all features. This results in the shrinkage of the values of coefficient estimates towards zero. The extent to which the coefficient values are shrinked towards zero is controlled by the tuning parameter λ. If the value of λ is large, the model will penalize large values of coefficient estimates more and will reduce the model complexity more. Thus, the tuning parameter is used to control the model complexity.

Note that we are shrinking the coefficient values of features only, that is, we are reducing the values of β1, β2,…., βp where p is the number of features. We are not shrinking the value of the intercept β0 which is the mean value of the response variable when all features are 0 (X1=X2=…. Xp=0)

Lasso regularization

Ridge regularization successfully shrinks the value of the coefficients towards zero but never reduces them equal to zero. This appears to be a problem for models having a very high number of predictors. Models using ridge regularization and high number of predictors are not interpretable. For example, if we have p number of predictors out which only 3 predictors are useful, ridge regularization will create a model with all p predictors.

Lasso regularization is a slight modification of ridge which overcomes this issue. In lasso regularization, we try to minimize the following

The only difference is instead of minimizing the square of the coefficient estimates, the modulus is minimized. In lasso regularization, some of the coefficient estimates will be exactly equal to zero provided the value of the tuning parameter λ is sufficiently large enough. Thus, lasso also helps in finding only those predictors that are useful for the model.

Selecting the tuning parameter λ

Both ridge and lasso regularization makes use of the tuning parameter to control the model complexity. In order to decide the best value of λ, we make use of cross validation. We make a grid of values for λ and apply cross validation on the data with each value, and check for which value of it we get the least error. We then fit the model with the value of λ obtained from cross validation.

Regularization for classification problems

We now know that in regression models we minimize the RSS, and when we apply regularization techniques in these models we minimize the coefficient estimates in addition to RSS.

Similarly, as RSS is minimized in regression, we have many loss functions that are used in classification models. For example, one such loss function is cross entropy

where y is the actual value and y′ is the predicted value by the model. Except the loss function, the rest of the idea remains the same: to shrink the coefficient estimates towards zero and reduce model complexity. In place of RSS, the loss function of the classification will be replaced.

Which technique to use? Lasso or Ridge?

As for the question of which technique to use, it all depends on the data. If the response variable can be measured by only some of the predictors, lasso will outperform ridge, as it makes the coefficient estimates equal to zero. If the response variable needs many predictors, ridge will outperform lasso. We can check which technique is more suitable by using cross validation.

Keeping model accuracy aside, lasso always has the advantage of inherent feature selection as it makes the coefficient estimates equal to zero, and thus one can interpret which predictors are not necessary for the model.