Handling Multi-Collinearity in ML Models

Easy ways to improve the interpretability of Linear Regression models

Vishwa Pardeshi
Towards Data Science

--

Image from Unsplash

Multiple linear regression models are used to model relationship between response/dependent variables and explanatory/independent variables. However, several problems such as multi-collinearity, correlation of variance of error terms, non-linearity impact the model’s interpretability. In this article, multi-collinearity, its effects and techniques to deal with it will be discussed.

What is Multi-Collinearity?

When the explanatory variables which are assumed to be independent of each other are revealed to be closely related to each other, this correlation is referred to as collinearity. When this correlation is observed for two or more explanatory variables, it is known multi-collinearity.

Multi-collinearity is particularly undesirable because it impacts the interpretability of linear regression models. Linear regression model not only help establish the presence/absence of a relationship between the response and explanatory variables, it helps identify the individual effect of each explanatory variables on the response variable.

Hence, due to the presence of multi-collinearity, it is difficult to isolate these individual effects. In other words, multi-collinearity can be viewed as a phenomenon where two or more explanatory variables are highly linearly related to each other to the extent that an explanatory variable can be predicted from another with a substantial accuracy.

Effects of Multi-Collinearity

Due to the presence of collinearity/multi-collinearity, it becomes difficult to isolate the individual effects of explanatory variables on the response variable.

Multi-collinearity results in the following:

  1. Uncertainty in coefficient estimates or unstable variance: Small changes (adding/removing rows/columns) in the data results in change of coefficients.
  2. Increased standard error: Reduces the accuracy of the estimates and increases the chances of detection.
  3. Decreased statistical significance: Due to increased standard error, t-statistic declines which negatively impacts the capability of detecting statistical significance in coefficient leading to type-II error.
  4. Reducing coefficient & p-value: The importance of the correlated explanatory variable is masked due to collinearity.
  5. Overfitting: Leads to overfitting as is indicated by the high variance problem.

How to detect Multi-Collinearity?

In addition to observing the model’s behavior for the above stated effects, mutli-collinearity is quantitatively captured in correlation values too. Thus, the following can be used:

Correlation matrix:

Pearson’s correlation between two variables in the data varies between -1 to 1. The two variables data type should be numeric for calculation of pearson’s correlation value.

Here, the correlation matrix for the Auto MPG dataset using R. The column name contains string and hence is eliminated. The response variable is mpg which represents fuel consumption efficiency.

#data = Auto
#generate correlation matrix in R
correlation_matrix <- cor(Auto[, -which(names(Auto) == "name")])
Correlation Matrix for Autompg dataset; Image by author

The high correlation between cylinder and displacement, horsepower and weight can be observed. Additionally, there are several pairs of explanatory variables with high positive/negative correlation. Thus, there is multi-collinearity.

However, as one would notice, going through the table and identifying these variables is tiresome for 8 variables and it will only get worse as the number of variables increase. Thus, a heatmap of correlation is a more intuitive representation of correlation.

Heatmap of correlations:

Heatmap of correlations helps visualize the data better by adjusting the color for positive and negative correlation and size for magnitude.

In R, corrplot package can be used to create a heatmap of correlations.

library(corrplot)
corrplot(correlation_matrix, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45)
Heatmap of Correlation for Autompg dataset; Image by author

The heatmaps are definitely more intuitive & visual. However, it helps identify correlation between 2 variables strictly and fails to identify collinearity which exists between 3 or more variables, for which Variance Inflation Factor can be used.

Variance Inflation Factor (VIF): VIF is the ratio of variance of coefficient estimate when fitting the full model divided by the variance of coefficient estimate if fit on its own. The minimum possible value is 1 which indicates no collinearity. If value exceeds 5, then collinearity should be addressed.

library(rms)
multiple.lm <- lm(mpg ~ . -name, data = Auto)
summary(multiple.lm)
vif(multiple.lm)
VIF values for Autompg dataset; Image by author

The VIF value for cylinders, displacement, horsepower, and weight are way higher than 5 and hence should be handled as the collinearity is high in the data.

Dealing with Multi-Collinearity

Multi-collinearity can be handled with the following two methods. Note that this correlation between independent variable leads to data redundancy, eliminating which can help get rid of multi-collinearity.

  1. Introduce penalization or remove highly correlated variables: Use lasso and ridge regression to eliminate variables which provide information which is redundant. This can also be achieved by observing the VIF.
  2. Combine highly correlated variables: Since the collinear variables contain redundant information, combining them into a single variable using methods such as PCA to generate independent variables.

For Autompg linear regression model implemented in R, check out this Github repository. This repository explores multicollinearity along with interaction terms and non-linear transformations for linear regression models.

Reference:

Introduction to Statistical Learning in R by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani.

--

--