Why multicollinearity isn’t an issue in Machine Learning

Dealing with multicollinearity depends on the purpose of the analysis. Learning to distinguish between model interpretation and prediction will influence the data preparation step.

Tarek Ghanoum
Towards Data Science

--

Photo by Austin Chan on Unsplash edited by author

Having taught statistics for three years at Copenhagen Business School I got quite used to explaining the model assumptions of multiple linear regression (if you need a recap). The only difference in assumptions between simple linear regression and a multiple one is the issue of multicollinearity.

I taught my students how to check the assumption and how to deal with it, but all of a sudden a professor told me to ignore it. This article will outline his answer and inform you when to pay attention to the possibility of multicollinearity.

What is multicollinearity?

When features are highly correlated among themselves, we usually say that a problem of multicollinearity or intercorrelation exists. There a multiple ways of trying to detect multicollinearity. One way is by using a correlation matrix as shown below:

Multi-Collinearity of the iris dataset
Image by author

The matrix clearly shows a high correlation between petal length (cm) and petal width (cm). Another way of spotting multicollinearity is by calculating the Variance Inflation Factor (VIF) where the rule of thumb is that VIF shouldn’t be higher than 10:

Image by author

We have obviously spotted multicollinearity in the dataset, but should we deal with it? It depends!

Model interpretation

Introducing students to statistics the goal is usually model interpretation. In other words, the goal is to understand the impact or change in the target variable (also known as the dependent variable) when a change occurs in one of the independent variables (also called features). As an example we have the following equation (ignoring the error component to make things simple):

We can make it even more relatable by assuming that we are calculating the price of a house, which I assume depends on the number of rooms, squares, base area, and the number of floors:

By calculating the parameter estimates (beta coefficients) we can get a sense of the relationship between the price of a house and the independent variables. Assuming that the number of squares is independent of the other variables an increase will mean that the price of the house will increase by beta2.

Assuming that we have multicollinearity caused by a strong correlation between the number of rooms and the number of floors, not adjusting for or coping with multicollinearity will result in wrong parameter estimates and undermine the features’ statistical significance. In this scenario, I would have to remove one of the independent variables in order to get more reliable estimates (for a more visual representation of the effect click here).

A typical example of multicollinearity is when all nominal categories of a feature are included in a regression model. By excluding a class, I can use it as a baseline to interpret the parameter estimates while avoiding multicollinearity.

Model prediction

When the data preparation goal is prediction rather than interpretation, we find ourselves in another scenario. This point is outlined in the book Applied Linear Statistical Models by Kutner (2005):

The fact that some or all predictor variables are correlated among themselves does not, in general, inhibit our ability to obtain a good fit nor does it tend to affect inferences about mean responses or predictions of new observations.

To see why this is the case lets imagine we have the following dataset:

Image by author

The two independent variables, X1 and X2, are perfectly correlated. This would usually mean unreliable parameter estimates, but let’s proceed with the given data and fit a multiple linear regression model to it. The first equation might be the following:

The equation fits the data perfectly but another great fit would be the following (try inserting the numbers from the table and see for yourself):

The first equation says that X2 has a high impact on the target variable, while the second equation tells a completely different story. This means that we can easily have models which perfectly fit the data and even make predictions with them, as long as we don't start interpreting the parameter estimates.

In other words, when preparing the data set for a machine learning context where the goal is prediction, you do not have to remove any columns.

The code for the figure and table in the article:

I hope you enjoyed this article as much as I have enjoyed writing it. Leave a comment if you have any difficulties understanding my code. The Data Science community has given me a lot, so I am always open to giving back.

Feel free to connect with me on Linkedin and follow me on Medium to receive more articles.

--

--