The world’s leading publication for data science, AI, and ML professionals.

Overfitting is not the only problem Regularisation can help with

Understanding mathematically how Ridge Regression helps in cases where the number of features exceeds data points

Photo by Franki Chamaki on Unsplash
Photo by Franki Chamaki on Unsplash

The problem of more features than data

When we talk about regularisation, we almost always talk about it in the context of overfitting, but a lesser-known fact is that it can help solve the aforementioned problem.

Since data forms the very core of machine learning, you would never expect to run into such a problem. Having said that, it is common especially in ML problems concerning biology that we have many features and very few data points. In the case of regression problems, having more features than data points makes Ordinary Least Squares Regression (OLS) perform very poorly.

Let us understand why through a ridiculously simple example.

This is the equation of a straight line

hθ(x) = θ0 + θ1×1

To find out the values of θ0 and θ1 given x1 and hθ(x) we need at least 2 data points. Say we are predicting the weight of people based on a single feature – their age. In that case, we need data for at least two people to fit a straight line. What if we had the age of only ONE person?

Infinite solutions (Image By Author)
Infinite solutions (Image By Author)

We can see above that multiple straight lines can fit through one point and although I have only shown three such lines, such a system has infinite solutions!

The normal equation for linear regression and non-invertibility

Firstly, the cost function we minimize during linear regression is

Linear Regression Cost Function
Linear Regression Cost Function

Here m stands for the number of data points and theta is (n+1) dimensional vector, (n dimensions for n features + 1 dimension for the bias (θ0).

To minimize this cost function we set the partial derivatives with respect to each entry in the theta matrix to 0 and simultaneously solve for theta.

Here is our final normal equation

Normal Equation for Linear Regression
Normal Equation for Linear Regression

A detailed derivation for those familiar with calculus can be found here on page 11.

Now X is an (m X (n+1)) dimensional matrix consisting of m data points and n features. Note that the first column of X is filled with 1s to account for the bias. Hence X has a column for each of the n features plus a column of 1s making it (n+1) total columns.

If we had two data points consisting of features (2,5) and (4,7) this is what our matrix would look like.

An example X matrix
An example X matrix

Now in the normal equation, whenever we have m<n that is whenever X has lesser rows than features the inverse of the product of X transpose and X DOES NOT EXIST.

Here is a mathematical proof for the same. Please feel free to skip the proof if you are not familiar with Linear Algebra.

Proof 1
Proof 1

Takeaway

Ordinary Linear Regression will not give us a solution for a problem where the number of data points is lesser than the number of features due to the non-invertibility of the normal equation.

Regularisation to the rescue!

Here is the equation of the regularised cost function, more specifically Ridge Regression. Lambda is the regularisation parameter here which is often fine-tuned.

Ridge Regression Cost Function
Ridge Regression Cost Function

Using the same logic as in the previous section, we differentiate this function and set it equal to zero, solving for theta to arrive at the regularised normal equation.

Regularised Normal Equation
Regularised Normal Equation

Here is M is a matrix with the dimensions (n+1 X n+1) and has 1s on its diagonal and 0s everywhere else except for the entry in the top-left position which is 0. For example, a 3X3 M matrix looks like this:

3 Dimensional M matrix
3 Dimensional M matrix

Ta-Da! As long as lambda is strictly greater than 0 we can prove that the product in the regularised normal equation is invertible.

Here is the proof for the mathematically curious! Please feel free to skip it if you are not familiar with linear algebra.

Proof 2
Proof 2

Takeaway

Regularised linear regression (Ridge or Lasso) will give us a solution using the normal equation for a problem where the number of data points is lesser than the number of features due to the additional regularisation constraint.

So what about gradient descent then?

So far we have only spoken about the normal equation for linear regression, but what if you use gradient descent to optimize theta? Well although theoretically, this should give you a solution, it turns out that since we have multiple solutions due to more unknowns than features, the solution may not generalize well. Adding a regularisation term will help constrain the objective further and bias the solution to "small" values of theta(small changes in input don’t translate to large changes in output). This will usually help us converge to a more generalisable solution.

How are such problems dealt with in practice?

Usually when we face the problem of more features than data points when we are trying to solve a regression problem, here is what we do.

  1. Choose either Ridge (L2) Or Lasso Regression (L1)
  2. Use K-Fold Cross-Validation and tune the regularisation parameter (lambda)
  3. After tuning a lambda strictly greater than 0 should result in better performance than Ordinary Least Squares Regression
  4. In some extreme cases, after tuning if lambda turns out to be zero, this would mean that the problem itself is not a good fit for the algorithm and that Regularization won’t give a better solution than OLS

I know I talked about extreme examples where we only had 1 or 2 data points but in practice, you will face problems that have ~500 data points and ~5000 features giving you the option to tune lambda.

Conclusion

We saw an interesting use case of Regularisation apart from the more famous one (solving overfitting). We also saw mathematical proofs which show how regularisation can solve our objective. I would want you to be cautious though. Just because regularisation finds a solution, it doesn’t mean that this is the best one. In fact, regularisation will help perform better than OLS but other ML algorithms could solve your problem even better!

Moreover, there are other ways like dimensionality reduction to solve this problem of more features than data points

If you liked this article here are some more!

Dealing with features that have high cardinality

Regex essential for NLP

Powerful Text Augmentation using NLPAUG!

Scatter Plots on Maps using Plotly

Effortless Exploratory Data Analysis (EDA)

Check out my GitHub for some other projects. You can contact me here. Thank you for your time!


Related Articles