Last week we took a look at how to solve linear regression from scratch, using the normal equation. If you need a quick refresher, I highly recommend starting there before moving forward. If not, let’s dive into ridge regression!
Ridge Regression, like its sibling, Lasso Regression, is a way to "regularize" a linear model. In this context, regularization can be taken as a synonym for preferring a simpler model by penalizing larger coefficients. We can achieve this concretely by adding a measure of the size of our coefficients to our cost function, so that when we minimize the cost function during model training, we prefer smaller coefficients to larger ones.
In the case of Ridge Regression, this measure is the _ℓ₂-_norm of our coefficients (feature weights). We control the degree of regularization by multiplying this term by the scalar alpha (also commonly written as lambda, we use alpha to maintain consistency with scikit-learn style estimators). The resulting cost function we’d like to optimize looks like this:

Note, that in our solution, our normed theta term is modified slightly in that it excludes the first term of theta: the coefficient of our intercept or "bias" term. You may recall from the previous post on simple Linear Regression that this form lends itself neatly to representation in matrix form, which we will again make use of here with a slight modification:

Where A is a modified identity matrix to hold our regularization parameters. Since it makes little practical sense to regularize our intercept term, we will replace the first element in the identity matrix with zero, and otherwise leave the ones along the diagonal intact. An example looks something like this, for a problem where X contains three features, and a leading intercept column:

Note that we only include the regularization term when fitting our model. Once we have our vector of best coefficients for a given alpha, the method of prediction is the same as in Linear Regression.
Here is the code implementing the above closed-form approach:
Ridge Regression is a rich enough topic to warrant its own article, and although the scope of this post is restricted to a small piece of one possible implementation, it is worth briefly touching on some practical notes on using Ridge Regression successfully.
The bias/variance trade-off
The value of alpha that one selects in tuning the model has a large impact on the results. Setting alpha to zero makes Ridge Regression identical to Linear Regression. Low values of alpha lead to lower bias, and higher variance (prone to overfitting the training data). As alpha grows larger, the results will look more and more like a flat line through the mean of the data. Higher alpha models have higher bias, and lower variance (if alpha is too high, the model will under-fit the training data). A full discussion of this tradeoff is beyond the scope of this piece, but we provide a visual below as well as some code to explore this effect.

Code for the above graphic:
We leave you with some final notes on practical use.
- Ridge Regression, and regularization in general, is sensitive to the scale of the input features, it is good practice to scale your inputs prior to fitting a Ridge model.
- Alpha is what is known as a "hyper-parameter." To properly tune this value, it is important to implement not just a train/test split, but a sound cross-validation strategy as well. Scikit-learn offers several methods of implementing such an approach, as well as numerous practical examples.
- A good use case for Ridge Regression is when there is a large number of features, and it is beneficial to ensure that the model does not become overly sensitive to quirks in the training data.