
Introduction:
Ordinary Least Squares (‘OLS’) is one of the oldest and most simple algorithms used for regression. However, there are now several variants that were invented to address some of the weakness encountered when using regular least squares regression.
Despite being one of the oldest algorithms, linear models are still very useful. In fact they often can outperform fancy and sophisticated models. They are particularly useful when there is not a huge amount of observations, or when the inputs reliably predict the response (low signal to noise ratio).
In this article, we will first review the basic formulation of regression using Linear Regression, discuss how we solve for the parameters (weights) using gradient descent, and then introduce Ridge Regression. We will then discuss the Lasso, and finally the Elastic Net. This article also will belong to my series on building Machine Learning algorithms from Scratch (mostly). So far, I have discussed Logistic regression from scratch, deriving principal components from the singular value decomposition and genetic algorithms.
We will use a real world Cancer dataset from a 1989 study to learn about other types of regression, shrinkage, and why sometimes linear regression is not sufficient.
Cancer Data:
This dataset consists of 97 observations from a real scientific study done in 1989. The data includes 8 predictors and the outcome of interest is lpsa (log prostate specific antigen).
This dataset is discussed in somewhat detail in The Elements of Statistical Learning.
First, we load the libraries we will use and then read in the dataset.
Here is what the first few observations look like:

Of the 97 observations, 67 are indicated to belong to the training, while the remaining 30 are saved for testing at the end of training the algorithms. Notice, we won’t need the ‘Id’ column or the ‘train’ column, so we remove them. We also scale and centre our columns as is often recommended before regression.
We first split the 97 observations into an initial training set and testing set. The size of the initial training set was 67, with the remaining 30 observations in the test set. (x_train,y_train) and (x_test,y_test). We will further decompose our training set into a train/validation set later in the article. Note that our models will be evaluated on the test data, so we don’t use the test data anywhere in fitting our models.
Regression Setup:
First, consider a simple regression problem with N observations (rows) and p predictors (columns) that consists of:
- N x 1 vector of outcomes, Y.

- N x (p+1) matrix of observations , X

- (p+1) x 1 vector of Weights, W.

To obtain our predictions, we multiply our weights, W, by our observations, X. Hence, the residuals, or the difference between the true outcome and our prediction can be represented in a N x 1 matrix:

The ‘perfect’ case would be that the matrix in Figure 6 was full of zeros, as this would represent a perfect fit on the training data. But that is almost never the case, and this would also likely represent a case of ‘overfitting’ the model.
Cost Functions:
In order to determine how good a model is, we need some definition of ‘goodness’. In linear regression, this is almost always the mean square error (MSE). This is simply the sum of square errors between our estimate and the true observation.

Typically, this is defined over an example as the loss function. For the whole training data, we work with the cost function, which is the average of losses over each training example.
To find the minimum of the cost surface, we use gradient descent , which involves taking the derivative with respect to each parameter.
When there are only two parameters, the cost surface can actually be visualized as a contour plot. In higher dimensions, we cant directly visualize the surface, but the process remains the same for finding the minimum. Gradient descent relies on the learning rate alpha, which controls the step size we take.

For each epoch or iteration, we calculate the derivative of each parameter with respect to the cost function, and take a step in the direction (the steepest) direction. This ensures we reach a minima (eventually). In reality, it is not so simple, as the learning rate can be too large, or too small leading to becoming trapped in local optima.

Now is a good time to define some helper functions that we will use later on:
Linear Regression:
Linear Regression is the most simple regression algorithm and was first described in 1875. The name ‘regression’ derives from the phenomena Francis Galton noticed of regression towards the mean. This referred to the fact that while children of very tall parents or very short parents were usually still taller or shorter , they tended to be closer to the mean height. This was termed "regression towards the mean".

Least Squares Regression works by simply fitting a line (or a hypersurface in more than 2 dimensions) and computing the distance from the estimate to the actual observed points. The Least Squares model is the model that minimizes the squared distance between the model and the observed data.


You may notice that this could make our algorithm susceptible to outliers, where a single outlying observation could greatly impact our estimate. That is true. In other words, linear regression is not robust to outliers.
Another issue is that we may fit the line to the training data too well. Suppose we have a lot of training data, and many predictors, some with colinearlity. It is possible we will obtain a line that fits the training data extremely well, but it may not perform as well on test data. This is where the alternate linear regression methods can excel. Because we consider all the predictors in least squares, this makes it susceptible to overfitting, as there is no penalty for adding extra predictors.
Because Linear Regression doesn’t require that we tune any hyperparameters, we can fit our model using the training dataset. We then evaluate the linear model on the test data set, and obtain our Mean Square Error.
Gradient Descent from Scratch:
The following code implements gradient descent from scratch, and we provide the option of adding in a regularization parameter. By default, ‘reg’ is set to zero, so this will be equivalent to gradient descent on the cost function associated with simple least squares. When reg is larger than zero, the algorithm will produce results for ridge regression.
Since we are now using a custom function, we need to add a column of ones to our matrix x_train_scaled, these will account for the intercept term (the terms that will multiply by the weight W0). We also turn our objects into numpy arrays to allow for easier matrix calculations.
Let us check out how the gradient descent went:

Now let use use the weights we obtained using gradient descent to form a prediction on our test data. Our builtin MSE function will use Wlinear to calculate the predictions, and will return the test MSE.
Using gradient descent to obtain our weights, we obtain an MSE of 0.547 on our test data.
Ridge Regression:
Ridge regression works with an enhanced cost function when compared to the least squares cost function. Instead of the simple sum of squares, Ridge regression introduces an additional ‘regularization’ parameter that penalizes the size of the weights.

Fortunately, the derivative of this cost function is still easy to compute and hence we can still use gradient descent.

Quick Facts:
- Ridge regression is a special case of Tikhonov regularization
- Closed form solution exists, as the addition of diagonal elements on the matrix ensures it is invertible.
- Allows for a tolerable amount of additional bias in return for a large increase in efficiency.
- Used in Neural Networks, where it is referred to as Weight Decay.
- Use when you have too many predictors, or predictors have a high degree of Multicollinearity between each other.
- Equivalent to Ordinary Least Squares when lambda is 0.
- Also known L2 Regularization.
- You must scale your predictors before applying Ridge.

Because we have a hyperparameter, lambda, in Ridge regression we form an additional holdout set called the validation set. This is separate from the test set and allows us to tune the ideal hyperparameter.
Choosing Lambda:
To find the ideal lambda, we calculate the MSE on the validation set using a sequence of possible lambda values. The function getRidgeLambda tries a sequence of lambda values on the holdout training set, and checks the MSE on the validation set. It returns the ideal parameter lambda, which we will then use to fit our whole training data with.
The ideal lambda is 8.8 as it results in the lowest MSE on the validation data.
Using Cross Validation, we obtain an ideal ‘reg’ parameter of lambda=8.8, so we use this to obtain our ridge estimates using gradient descent.
Using Ridge Regression, we get an even better MSE on the test data of 0.511. Notice our coefficients have been ‘shrunk’ when compared to the coefficients estimated in least squares.
Lasso Regression:
Lasso Regression or (‘Least Absolute Shrinkage and Selection Operator’) also works with an alternate cost function;

However, the derivative of the cost function has no closed form (due to the L1 loss on the weights) which means we can’t simply apply gradient descent. Lasso allows for the possibility that a coefficient can actually be forced to zero (see Figure 19), essentially making Lasso a method of model selection as well as a regression technique.
Quick Facts:
- Known as a method that ‘induces sparsity’.
- Sometimes referred to as Basis Pursuit.

Since we can’t apply gradient descent, we use scikit-learn’s built in function for calculating the ideal weights. However, this still requires we pick the ideal shrinkage parameter (as we did for ridge). We take the same approach that we took in ridge regression to search for the ideal regularization parameter on the validation data.
Lasso provides an MSE of 0.482 on the test data , which is even less than ridge and linear regression! Moreover, Lasso also sets some coefficients to zero, eliminating them completely from consideration.
The Elastic Net:
Finally, we come to the Elastic Net.

The elastic net has TWO parameters, thus, instead of searching for a single ideal parameter, we will need to search a grid of combinations. Hence training might be a bit slow. Instead of search for lambda1 and lambda2 directly, it often is best to instead search for an ideal ratio between the two parameters, and an alpha parameter that is the sum of lambda1 and lambda2.
Quick Facts:
- Linear, Ridge and the Lasso can all be seen as special cases of the Elastic net.
- In 2014, it was proven that the Elastic Net can be reduced to a linear support vector machine.
- The loss function is strongly convex, and hence a unique minimum exists.
The Elastic Net is an extension of the Lasso, it combines both L1 and L2 regularization. So we need a lambda1 for the L1 and a lambda2 for the L2. Similarly to the Lasso, the derivative has no closed form, so we need to use Python‘s built in functionality. We also need to find the ideal ratio between our two parameters, and the additional alpha parameter that is the sum of lambda1 and lambda2.

We don’t code the Elastic Net from scratch, scikit-learn provides it.
We do however perform cross validation to choose the two parameters, alpha and l1_ratio. Once we have the ideal parameters, we train our algorithm on the full training using the chosen parameters.
Wow! Elastic Net provides an even smaller MSE (0.450) than all the other models.
Putting it together:
Finally, we have calculated the results for Least Squares, Ridge, Lasso and the Elastic Net. We have obtained weights for each of these methods, and also obtained the MSE on the original test dataset. We can summarize how these methods performed in a table.

Simple least squares performed the worst on our test data compared to all other models. Ridge regression provided similar results to least squares, but it did better on the test data and shrunk most of the parameters. Elastic Net ended up providing the best MSE on the test dataset by quite a wide margin. Elastic Net removed lcp, gleason and age and shrunk other parameters. Lasso also removed the consideration of age, lcp and gleason but performed slightly worse than Elastic Net.
Summary:
Understanding basic least squares regression is still extremely useful, but there are other improved methods that should also be considered. One issue with regular least squares is that it doesn’t account for the possibility of overfitting. Ridge regression takes care of this by shrinking certain parameters. Lasso takes this a step even further by allowing certain coefficients to be outright forced to zero, eliminating them from the model. Finally, Elastic Net combines the benefits of both lasso and ridge.
In certain cases, we can derive exact solutions for least squares, and we can always derive solutions for ridge provided lambda is >0 . Choosing lambda is the hard part (you should use cross validation on the training dataset to choose the ideal lambda). We didn’t show the closed form solutions in this guide, because it is better to learn how the solutions can be solved from scratch, and because the closed form solutions don’t usually exist in high dimensions.
Thank you for reading, and please send me any questions or comments!
Want to learn more?
If you liked these topics and want to learn more advanced regression techniques, check out the following topics:
Sources:
Images from Elements of Statistical Learning are used with permission. "The authors (Hastie) retain the copyright for all these figures. They can be used in academic presentations."
Code on GitHub:
[1] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology).
[2] Amini, Alexander & Soleimany, Ava & Karaman, Sertac & Rus, Daniela. (2018). _Spatial Uncertainty Sampling for End-to-End Control. Neural Information Processing Systems (NIPS)._
[3] Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., & Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. The Journal of urology, 141(5), 1076–1083. https://doi.org/10.1016/s0022-5347(17)41175-x
[4] Hastie, T., Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.
[5] Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics.
[6] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.