The world’s leading publication for data science, AI, and ML professionals.

From Linear Regression to Ridge Regression, the Lasso, and the Elastic Net

And why you should learn alternative regression techniques.

Figure 1: An image visualising how ordinary regression compares to the Lasso, the Ridge and the Elastic Net Regressors. Image Citation: Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.
Figure 1: An image visualising how ordinary regression compares to the Lasso, the Ridge and the Elastic Net Regressors. Image Citation: Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net.

Introduction:

Ordinary Least Squares (‘OLS’) is one of the oldest and most simple algorithms used for regression. However, there are now several variants that were invented to address some of the weakness encountered when using regular least squares regression.

Despite being one of the oldest algorithms, linear models are still very useful. In fact they often can outperform fancy and sophisticated models. They are particularly useful when there is not a huge amount of observations, or when the inputs reliably predict the response (low signal to noise ratio).

In this article, we will first review the basic formulation of regression using Linear Regression, discuss how we solve for the parameters (weights) using gradient descent, and then introduce Ridge Regression. We will then discuss the Lasso, and finally the Elastic Net. This article also will belong to my series on building Machine Learning algorithms from Scratch (mostly). So far, I have discussed Logistic regression from scratch, deriving principal components from the singular value decomposition and genetic algorithms.

We will use a real world Cancer dataset from a 1989 study to learn about other types of regression, shrinkage, and why sometimes linear regression is not sufficient.

Cancer Data:

This dataset consists of 97 observations from a real scientific study done in 1989. The data includes 8 predictors and the outcome of interest is lpsa (log prostate specific antigen).

This dataset is discussed in somewhat detail in The Elements of Statistical Learning.

First, we load the libraries we will use and then read in the dataset.

Here is what the first few observations look like:

Figure 2: The first few observations of the prostate cancer dataset. The predictors include log cavol, log weight, age, lbph, sci , lcp , gleason, pgg45. We also have an indicator telling us if the observation belongs to the training or test set. Image from Author.
Figure 2: The first few observations of the prostate cancer dataset. The predictors include log cavol, log weight, age, lbph, sci , lcp , gleason, pgg45. We also have an indicator telling us if the observation belongs to the training or test set. Image from Author.

Of the 97 observations, 67 are indicated to belong to the training, while the remaining 30 are saved for testing at the end of training the algorithms. Notice, we won’t need the ‘Id’ column or the ‘train’ column, so we remove them. We also scale and centre our columns as is often recommended before regression.

We first split the 97 observations into an initial training set and testing set. The size of the initial training set was 67, with the remaining 30 observations in the test set. (x_train,y_train) and (x_test,y_test). We will further decompose our training set into a train/validation set later in the article. Note that our models will be evaluated on the test data, so we don’t use the test data anywhere in fitting our models.

Regression Setup:

First, consider a simple regression problem with N observations (rows) and p predictors (columns) that consists of:

  • N x 1 vector of outcomes, Y.
Figure 3: Our vector of outcomes for each of our N observations. Figure from Author.
Figure 3: Our vector of outcomes for each of our N observations. Figure from Author.
  • N x (p+1) matrix of observations , X
Figure 4: Each of the N observations is represented in a row. We also add in an additional 1 to each observation to account for the intercept or 'bias' term. Figure from Author.
Figure 4: Each of the N observations is represented in a row. We also add in an additional 1 to each observation to account for the intercept or ‘bias’ term. Figure from Author.
  • (p+1) x 1 vector of Weights, W.
Figure 5: Our vector of Weights, W. Figure from Author.
Figure 5: Our vector of Weights, W. Figure from Author.

To obtain our predictions, we multiply our weights, W, by our observations, X. Hence, the residuals, or the difference between the true outcome and our prediction can be represented in a N x 1 matrix:

Figure 6: Our Prediction minus our estimate. Note that the estimate is obtained my multiplying the weights by our observations. The closer our prediction is to the true value, the closer to zero the row will be. Figure from Author.
Figure 6: Our Prediction minus our estimate. Note that the estimate is obtained my multiplying the weights by our observations. The closer our prediction is to the true value, the closer to zero the row will be. Figure from Author.

The ‘perfect’ case would be that the matrix in Figure 6 was full of zeros, as this would represent a perfect fit on the training data. But that is almost never the case, and this would also likely represent a case of ‘overfitting’ the model.

Cost Functions:

In order to determine how good a model is, we need some definition of ‘goodness’. In linear regression, this is almost always the mean square error (MSE). This is simply the sum of square errors between our estimate and the true observation.

Figure 7: The red dots represent the actual observations, and the surface represents our prediction at any point (X1,X2). The lines indicate the distance from our prediction to the actual observed data. The sum of the squares of these distances defines our cost in least squares. Image Citation: This image is used with permission and appears as Figure 3.1 in Elements of Statistical Learning, II edition.
Figure 7: The red dots represent the actual observations, and the surface represents our prediction at any point (X1,X2). The lines indicate the distance from our prediction to the actual observed data. The sum of the squares of these distances defines our cost in least squares. Image Citation: This image is used with permission and appears as Figure 3.1 in Elements of Statistical Learning, II edition.

Typically, this is defined over an example as the loss function. For the whole training data, we work with the cost function, which is the average of losses over each training example.

To find the minimum of the cost surface, we use gradient descent , which involves taking the derivative with respect to each parameter.

When there are only two parameters, the cost surface can actually be visualized as a contour plot. In higher dimensions, we cant directly visualize the surface, but the process remains the same for finding the minimum. Gradient descent relies on the learning rate alpha, which controls the step size we take.

Figure 8: Gradient descent is the process of taking steps to find the minimum of the loss surface. Image Citation: https://www.researchgate.net/figure/Non-convex-optimization-We-utilize-stochastic-gradient-descent-to-find-a-local-optimum_fig1_325142728
Figure 8: Gradient descent is the process of taking steps to find the minimum of the loss surface. Image Citation: https://www.researchgate.net/figure/Non-convex-optimization-We-utilize-stochastic-gradient-descent-to-find-a-local-optimum_fig1_325142728

For each epoch or iteration, we calculate the derivative of each parameter with respect to the cost function, and take a step in the direction (the steepest) direction. This ensures we reach a minima (eventually). In reality, it is not so simple, as the learning rate can be too large, or too small leading to becoming trapped in local optima.

Figure 9: Gradient Descent. The steps for training our weights (parameters). This involves updating each weight by subtracting the derivative of the cost function with respect to the weight, multiplied by alpha ( the learning rate). Figure from Author.
Figure 9: Gradient Descent. The steps for training our weights (parameters). This involves updating each weight by subtracting the derivative of the cost function with respect to the weight, multiplied by alpha ( the learning rate). Figure from Author.

Now is a good time to define some helper functions that we will use later on:

Linear Regression:

Linear Regression is the most simple regression algorithm and was first described in 1875. The name ‘regression’ derives from the phenomena Francis Galton noticed of regression towards the mean. This referred to the fact that while children of very tall parents or very short parents were usually still taller or shorter , they tended to be closer to the mean height. This was termed "regression towards the mean".

Figure 10: Galton's Regression towards mediocrity in hereditary stature. Image Citation: https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2011.00509.x
Figure 10: Galton’s Regression towards mediocrity in hereditary stature. Image Citation: https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2011.00509.x

Least Squares Regression works by simply fitting a line (or a hypersurface in more than 2 dimensions) and computing the distance from the estimate to the actual observed points. The Least Squares model is the model that minimizes the squared distance between the model and the observed data.

Figure 11: Cost Function for linear regression. The cost is the normalised sum of the individual loss functions. This is the same as the Mean Square Error multiplied by a scalar ( the result in the end is equivalent). Figure from Author.
Figure 11: Cost Function for linear regression. The cost is the normalised sum of the individual loss functions. This is the same as the Mean Square Error multiplied by a scalar ( the result in the end is equivalent). Figure from Author.
Figure 12: Derivative of the cost function for linear regression. Figure from Author.
Figure 12: Derivative of the cost function for linear regression. Figure from Author.

You may notice that this could make our algorithm susceptible to outliers, where a single outlying observation could greatly impact our estimate. That is true. In other words, linear regression is not robust to outliers.

Another issue is that we may fit the line to the training data too well. Suppose we have a lot of training data, and many predictors, some with colinearlity. It is possible we will obtain a line that fits the training data extremely well, but it may not perform as well on test data. This is where the alternate linear regression methods can excel. Because we consider all the predictors in least squares, this makes it susceptible to overfitting, as there is no penalty for adding extra predictors.

Because Linear Regression doesn’t require that we tune any hyperparameters, we can fit our model using the training dataset. We then evaluate the linear model on the test data set, and obtain our Mean Square Error.

Gradient Descent from Scratch:

The following code implements gradient descent from scratch, and we provide the option of adding in a regularization parameter. By default, ‘reg’ is set to zero, so this will be equivalent to gradient descent on the cost function associated with simple least squares. When reg is larger than zero, the algorithm will produce results for ridge regression.

Since we are now using a custom function, we need to add a column of ones to our matrix x_train_scaled, these will account for the intercept term (the terms that will multiply by the weight W0). We also turn our objects into numpy arrays to allow for easier matrix calculations.

Let us check out how the gradient descent went:

Figure 14: The cost decreases quite quickly as we continue to form better and better weights. Figure from Author. Figure from Author.
Figure 14: The cost decreases quite quickly as we continue to form better and better weights. Figure from Author. Figure from Author.

Now let use use the weights we obtained using gradient descent to form a prediction on our test data. Our builtin MSE function will use Wlinear to calculate the predictions, and will return the test MSE.

Using gradient descent to obtain our weights, we obtain an MSE of 0.547 on our test data.

Ridge Regression:

Ridge regression works with an enhanced cost function when compared to the least squares cost function. Instead of the simple sum of squares, Ridge regression introduces an additional ‘regularization’ parameter that penalizes the size of the weights.

Figure 15: Cost Function for Ridge regression. The cost is the normalized sum of the individual loss functions. This cost function penalizes the weights by a positive parameter lambda. Figure from Author.
Figure 15: Cost Function for Ridge regression. The cost is the normalized sum of the individual loss functions. This cost function penalizes the weights by a positive parameter lambda. Figure from Author.

Fortunately, the derivative of this cost function is still easy to compute and hence we can still use gradient descent.

Figure 16: The derivative of the cost function for ridge regression. Figure from Author.
Figure 16: The derivative of the cost function for ridge regression. Figure from Author.

Quick Facts:

  • Ridge regression is a special case of Tikhonov regularization
  • Closed form solution exists, as the addition of diagonal elements on the matrix ensures it is invertible.
  • Allows for a tolerable amount of additional bias in return for a large increase in efficiency.
  • Used in Neural Networks, where it is referred to as Weight Decay.
  • Use when you have too many predictors, or predictors have a high degree of Multicollinearity between each other.
  • Equivalent to Ordinary Least Squares when lambda is 0.
  • Also known L2 Regularization.
  • You must scale your predictors before applying Ridge.
Figure 17: Comparison of the OLS estimates and Ridge Regression estimates in the two dimensional case. Notice the ridge estimates are bounded in a circle at the origin from the regularization term in the cost function. The Ridge estimates can be viewed as the point where the linear regression coefficient contours intersect the circle defined by B1²+B2²≤lambda. Image Citation: Elements of Statistical Learning , 2nd Edition.
Figure 17: Comparison of the OLS estimates and Ridge Regression estimates in the two dimensional case. Notice the ridge estimates are bounded in a circle at the origin from the regularization term in the cost function. The Ridge estimates can be viewed as the point where the linear regression coefficient contours intersect the circle defined by B1²+B2²≤lambda. Image Citation: Elements of Statistical Learning , 2nd Edition.

Because we have a hyperparameter, lambda, in Ridge regression we form an additional holdout set called the validation set. This is separate from the test set and allows us to tune the ideal hyperparameter.

Choosing Lambda:

To find the ideal lambda, we calculate the MSE on the validation set using a sequence of possible lambda values. The function getRidgeLambda tries a sequence of lambda values on the holdout training set, and checks the MSE on the validation set. It returns the ideal parameter lambda, which we will then use to fit our whole training data with.

The ideal lambda is 8.8 as it results in the lowest MSE on the validation data.

Using Cross Validation, we obtain an ideal ‘reg’ parameter of lambda=8.8, so we use this to obtain our ridge estimates using gradient descent.

Using Ridge Regression, we get an even better MSE on the test data of 0.511. Notice our coefficients have been ‘shrunk’ when compared to the coefficients estimated in least squares.

Lasso Regression:

Lasso Regression or (‘Least Absolute Shrinkage and Selection Operator’) also works with an alternate cost function;

Figure 18: The Cost function for the Lasso Regression. We still regularize, but using L1 regularization instead of L2 as in ridge. This derivative of this cost function has no closed form. Figure from Author.
Figure 18: The Cost function for the Lasso Regression. We still regularize, but using L1 regularization instead of L2 as in ridge. This derivative of this cost function has no closed form. Figure from Author.

However, the derivative of the cost function has no closed form (due to the L1 loss on the weights) which means we can’t simply apply gradient descent. Lasso allows for the possibility that a coefficient can actually be forced to zero (see Figure 19), essentially making Lasso a method of model selection as well as a regression technique.

Quick Facts:

  • Known as a method that ‘induces sparsity’.
  • Sometimes referred to as Basis Pursuit.
Figure 19: Comparison of the OLS estimates and the Lasso Regression estimates. Notice the Lasso estimates are bounded in a box at the origin from the regularization term in the cost function. The point where the ellipses intersect the bounding box give us the lasso estimates. Notice in the above, we intersect at a corner, this results in that coefficient (B1) in the above case being set to zero. Image Citation: Elements of Statistical Learning, II edition.
Figure 19: Comparison of the OLS estimates and the Lasso Regression estimates. Notice the Lasso estimates are bounded in a box at the origin from the regularization term in the cost function. The point where the ellipses intersect the bounding box give us the lasso estimates. Notice in the above, we intersect at a corner, this results in that coefficient (B1) in the above case being set to zero. Image Citation: Elements of Statistical Learning, II edition.

Since we can’t apply gradient descent, we use scikit-learn’s built in function for calculating the ideal weights. However, this still requires we pick the ideal shrinkage parameter (as we did for ridge). We take the same approach that we took in ridge regression to search for the ideal regularization parameter on the validation data.

Lasso provides an MSE of 0.482 on the test data , which is even less than ridge and linear regression! Moreover, Lasso also sets some coefficients to zero, eliminating them completely from consideration.

The Elastic Net:

Finally, we come to the Elastic Net.

Figure 20: The cost function for the Elastic Net. It contains L1 and L2 loss. Figure from Author.
Figure 20: The cost function for the Elastic Net. It contains L1 and L2 loss. Figure from Author.

The elastic net has TWO parameters, thus, instead of searching for a single ideal parameter, we will need to search a grid of combinations. Hence training might be a bit slow. Instead of search for lambda1 and lambda2 directly, it often is best to instead search for an ideal ratio between the two parameters, and an alpha parameter that is the sum of lambda1 and lambda2.

Quick Facts:

  • Linear, Ridge and the Lasso can all be seen as special cases of the Elastic net.
  • In 2014, it was proven that the Elastic Net can be reduced to a linear support vector machine.
  • The loss function is strongly convex, and hence a unique minimum exists.

The Elastic Net is an extension of the Lasso, it combines both L1 and L2 regularization. So we need a lambda1 for the L1 and a lambda2 for the L2. Similarly to the Lasso, the derivative has no closed form, so we need to use Python‘s built in functionality. We also need to find the ideal ratio between our two parameters, and the additional alpha parameter that is the sum of lambda1 and lambda2.

Figure 21: The Elastic Net (red) is a combination of ridge regression (green) and the lasso (blue). Image Citation: https://www.researchgate.net/figure/Visualization-of-the-elastic-net-regularization-red-combining-the-L2-norm-green-of_fig6_330380054
Figure 21: The Elastic Net (red) is a combination of ridge regression (green) and the lasso (blue). Image Citation: https://www.researchgate.net/figure/Visualization-of-the-elastic-net-regularization-red-combining-the-L2-norm-green-of_fig6_330380054

We don’t code the Elastic Net from scratch, scikit-learn provides it.

We do however perform cross validation to choose the two parameters, alpha and l1_ratio. Once we have the ideal parameters, we train our algorithm on the full training using the chosen parameters.

Wow! Elastic Net provides an even smaller MSE (0.450) than all the other models.

Putting it together:

Finally, we have calculated the results for Least Squares, Ridge, Lasso and the Elastic Net. We have obtained weights for each of these methods, and also obtained the MSE on the original test dataset. We can summarize how these methods performed in a table.

Figure 22: The final comparison of each of our models. This table summarizes the final estimated coefficients, and the mean square error on the test set. Figure from Author.
Figure 22: The final comparison of each of our models. This table summarizes the final estimated coefficients, and the mean square error on the test set. Figure from Author.

Simple least squares performed the worst on our test data compared to all other models. Ridge regression provided similar results to least squares, but it did better on the test data and shrunk most of the parameters. Elastic Net ended up providing the best MSE on the test dataset by quite a wide margin. Elastic Net removed lcp, gleason and age and shrunk other parameters. Lasso also removed the consideration of age, lcp and gleason but performed slightly worse than Elastic Net.

Summary:

Understanding basic least squares regression is still extremely useful, but there are other improved methods that should also be considered. One issue with regular least squares is that it doesn’t account for the possibility of overfitting. Ridge regression takes care of this by shrinking certain parameters. Lasso takes this a step even further by allowing certain coefficients to be outright forced to zero, eliminating them from the model. Finally, Elastic Net combines the benefits of both lasso and ridge.

In certain cases, we can derive exact solutions for least squares, and we can always derive solutions for ridge provided lambda is >0 . Choosing lambda is the hard part (you should use cross validation on the training dataset to choose the ideal lambda). We didn’t show the closed form solutions in this guide, because it is better to learn how the solutions can be solved from scratch, and because the closed form solutions don’t usually exist in high dimensions.

Thank you for reading, and please send me any questions or comments!

Want to learn more?

If you liked these topics and want to learn more advanced regression techniques, check out the following topics:

Sources:

Images from Elements of Statistical Learning are used with permission. "The authors (Hastie) retain the copyright for all these figures. They can be used in academic presentations."

Code on GitHub:

Robby955/CancerData

[1] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology).

[2] Amini, Alexander & Soleimany, Ava & Karaman, Sertac & Rus, Daniela. (2018). _Spatial Uncertainty Sampling for End-to-End Control. Neural Information Processing Systems (NIPS)._

[3] Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. The Journal of urology, 141(5), 1076–1083. https://doi.org/10.1016/s0022-5347(17)41175-x

[4] Hastie, T., Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.

[5] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics.

[6] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.


Related Articles