Why exclude highly correlated features when building regression model ??

Aishwarya V Srinivasan
Towards Data Science
4 min readAug 23, 2019

--

If you are someone who has worked with data for quite some time, you must be knowing that the general practice is to exclude highly correlated features while running linear regression. The objective of this article is to explain why we need to avoid highly correlated features while building a simple linear regression model. I highly recommend you to refer to my article on regression before continuing with this one.

What is correlation?

Correlation simply means a mutual relationship between two or more things. Consider data points (xᵢ , yᵢ), i = 1,2,…n in a dataset. The objective of correlation is to see if large values of “x” are paired with large values of “y” and small values of “x” are paired with small values of “y”. If not, check if small values of “x” are paired with large values of “y” and vice versa.

In statistics, the above phenomenon is measured by a fitting function called correlation coefficient. The formula to measure correlation is

Correlation coefficient formula

x̄ and ̄y are means of x and y respectively. When correlation coefficient is < 0, we say that x and y are negatively correlated. If it is > 0, they are positively correlated. Correlation coefficient varies between -1 and 1.

Most important point to note is, correlation measures only the association between the two variables and does not measure causation. i.e., large values of “y” is not caused by large values of “x” or vice versa, rather it so happens that such data pairs just exist in the dataset.

Why exclude highly correlated features?

If you recall from my last article on regression, regression is all about learning the weight vector from the training data and using it to make predictions. The formula for obtaining the weight vector is

We have a probabilistic view of regression where it is assumed that the dependent variable “y” is normally distributed with variance σ². Under this assumption, it can be mathematically shown that the variance of the above weight vector Wₗₛ is

Variance of Wₗₛ

For the model to be stable enough, the above variance should be low. If the variance of the weights is high, it means that the model is very sensitive to data. The weights differ largely with training data if the variance is high. It means that the model might not perform well with test data. So, the natural question would now be,

When will the variance of Wₗₛ be large?

Now you should have guessed that when we have highly correlated features, the variance of Wₗₛ will be large. Yes, the guess is right !! But let us see how this is correct mathematically. Any n x d matrix can be decomposed as

Singular Value Decomposition

The above decomposition is called “Singular Value Decomposition”. The “S” matrix in the above equation is a non negative diagonal matrix. Using this decomposition, the variance of Wₗₛ can be re-written as

When we have highly correlated features in the dataset, the values in “S” matrix will be small. So inverse square of “S” matrix (S^-2 in the above equation) will be large which makes the variance of Wₗₛ large.

So, it is advised that we keep only one feature in the dataset if two features are highly correlated. I hope this article was helpful. Please leave your queries if any below.

--

--