The world’s leading publication for data science, AI, and ML professionals.

Ace your Machine Learning Interview – Part 1

Dive into Linear, Lasso and Ridge Regression and their assumptions

Introduction

These days I am having several interviews in the field of Machine Learning as I have moved abroad and need to look for a new job.

Big companies and small startups always want to make sure you know the fundamentals of Machine Learning, and so I’m using some of my time going over the basics again. So I decided to share a series of articles about what you need to know to deal with interviews in Machine Learning hoping it will help some of you as well.

Linear Regression

When we talk about Linear Regression, we have a set of points that for ease you can think of plotted on a plane in 2 dimensions (x : feature, y : label) and we want to fit these points with a straight line.

That is, we want to find that straight line that passes right ‘between’ the points as in the figure above.

We know that the equation representing the straight line in red is of the type: *y = ax + b**. So finding the right straight line means estimating the parameters a and b that we need to plot the straight line.

But how do we find these parameters that we can call θ = (a,b) ?

Intuitively the mechanism is simple, we start with any straight line and iteratively adjust it until we find a better straight line.

But how do we improve the estimate of θ = (a,b) at each iteration? To do this we need to find a way to be able to tell for each line how much it has been in error. In the image above, the ITER-0 line has been wrong more than the ITER-1 line. Now we need to find a formalism to make this explicit even if visually it is trivial.

This is where the so-called Loss Function comes to our aid. This function, given a line and our dataset points, will return a number which is the equivalent of the error made by the network, great! The loss function in question is called Mean Squared Error (MSE).

In this function the error of all points is averaged, that is, the distance of each point from the line. In fact, if you look closely at the parentheses we have the difference between the actual y and the predicted y. Note that the predicted y would be the result of *ax + b** (where initially a and b are taken at random).

Now that we also have a way to quantify the error of our straight line, we need a method to improve the parameters θ = (a,b) iteratively.

To do this we use the Gradient Descent algorithm! The algorithm tells us to update the parameters in the following way.

Then we only need to compute the partial derivative of the loss function with respect to the two parameters (a and b) and set a learning rate α, which will be the hyperparameter that will allow us to converge (thus find the optimal line) more or less quickly. If we choose too low a learning rate we may converge too slowly. Conversely, a high learning rate may not lead us to convergence.

We can repeat this iteration, until the value of the loss function, then the error of our network is low, at which point we would have found a line fitting our points!

We can see this continuous updating of parameters to minimize loss also visually, as finding the minimum of a function by going down step by step.

Multiple Linear Regression

Obviously, linear regression can be generalized in case we have n-dimensional points therefore with n-1 features and not just one. In this case, each point in the dataset will consist of n-1 components.

In this case, the prediction of point i, will be calculated in the following way.

Here, we will have to estimate all parameters a1,a2,..an,b.

Problem of Multicollinearity

The multicollinearity problem occurs when two or more features are highly correlated. One of the greatest advantages of linear regression is its explainability. In a model of the type y = a1x1 + a2x2 I know that whenever x1 increases by one unit y increases by a1 units. This is immediately derived from the formula. But in the case that x1 and x2 were correlated, when x1 increases by one unit, x2 also increases in some way, and I can no longer tell how y will change! Think what a mess when I have n features! To do multicollinearity detection one can use a correlation matrix, which is usually plotted with a heatmap. The heatmap has more intense values when the variables are correlated.

Ridge Regression

Ridge Regression is a variant of Linear Regression, simply a slight modification is made to the loss function.

The loss function that is used is the following.

The difference from the MSE seen before is the orange part, which we call the penalty. The penalty is made of a sum of the parameters of the model (the straight line) squared. So the larger the parameters a1,…,an the larger the error of the network. In this way, the loss function forces the Gradient Descent algorithm to find a line that has small parameters, so it is an additional constraint, it serves to limit the learning capacity of the model so that it does not fall into overfitting. Lambda is a multiplicative factor, and it is another hyperparameter of the network.

Lasso Regression

Lasso Regression is very similar to ridge, but the penalty changes slightly.

In this case, we have the modulus of the parameters instead of the square. The Lasso often leads some of the coefficients a1,…an to be zero. So it means that some of the features will be cancelled. Because of this, we can also use lasso regression as feature selection to figure out which features are most useless!

Basic Assumptions

Before applying linear regression there are assumptions we need to be aware of, there are 4 in particular.

  1. Linearity: obviously it must be possible to fit the data with a straight line (or hyperplane in multiple dimensions).
  2. Homoscedasticity: that is, the variance of the residuals remains the same for any value of x. Easier to understand with an example.

  1. Independence: observations are independent of one another.
  2. Normality of Errors: the residuals must be approximately normally distributed (you can check this using a QQ-Plot).

Advantages

  1. Linear regression performs exceptionally well for linearly separable data
  2. Easy to implement and train
  3. It can handle overfitting using regularization (lasso, ridge)

Disadvantages

  1. Sometimes a lot of feature engineering is required
  2. If the features are correlated it may affect performances
  3. It is sensitive to noise.

Let’s code!

A complete article I wrote previously on implementing linear regression can be found here: "Linear Regression and Gradient Descent Using Only Numpy". Now let’s see how to implement a simple linear regression with just a few lines of code using sklearn.

Final Thoughts

I have previously written an article on how to implement linear regression from scratch using only NumPy, you can find the article here! Linear regression is a fundamental algorithm to start studying ML if you understand this well then it will be much easier to understand more complicated algorithms like Neural Networks.

An evolution of linear regression is the Polynomial regression, a more complicated model that can fit also non-linear datasets introducing more complex features, please check here: https://en.wikipedia.org/wiki/Polynomial_regression.

The End

Marcello Politi

Linkedin, Twitter, CV


Related Articles