The world’s leading publication for data science, AI, and ML professionals.

Statistics in Python – Collinearity and Multicollinearity

Understand how to discovery multicollinearity in your dataset

Photo by Valentino Funghi on Unsplash
Photo by Valentino Funghi on Unsplash

In my previous article, you learned about the relationships between data in your dataset, be it within the same column (variance), or between columns (covariance and Correlation).

Statistics in Python – Understanding Variance, Covariance, and Correlation

Another two additional terms that you usually encounter when you embark on your machine learning journey are:

  • Collinearity
  • Multicollinearity

In this article, I want to explain the concept of collinearity and multicollinearity and why it is important to understand them and take appropriate action when preparing your data.

Correlation vs. Collinearity vs. Multicollinearity

If you recall, correlation measures the strength and direction between two columns in your dataset. Correlation is often used to find the relationship between a feature and the target:

Image by author
Image by author

For example, if one of the features has a high correlation with the target, it tells you that this particular feature heavily influences the target and should be included when you are training the model.

Collinearity, on the other hand, is a situation where two features are linearly associated (high correlation), and they are used as predictors for the target.

Image by author
Image by author

Multicollinearity is a special case of collinearity where a feature exhibits a linear relationship with two or more features.

Image by author
Image by author

Problem with collinearity and multicollinearity

Recall the formula for multiple Linear Regression:

Image by author
Image by author

One important assumption of linear regression is that there should exist a linear relationship between each of the predictors (x₁, x₂, etc) and the outcome y. However, if there is a correlation between the predictors (e.g. x₁ and x₂ are highly correlated), you can no longer determine the effect of one while holding the other constant since the two predictors change together. The end result is that the coefficients (w₁ and w₂) are now less exact and hence less interpretable.

Fixing Multicollinearity

When training a machine learning model, it is important that during the data preprocessing stage you sieve out the features in your dataset that exhibit Multicollinearity. You can do so using a method known as VIFVariance Inflation Factor.

VIF allows you to determine the strength of the correlation between the various independent variables. It is calculated by taking a variable and regressing it against every other variables.

VIF calculates how much the variance of a coefficient is inflated because of its linear dependencies with other predictors. Hence its name.

Here is how VIF works:

  • Assuming you have a list of features – x₁, x₂, x₃, and x₄.
  • You first take the first feature, x₁, and regress it against the other features:
x₁ ~ x₂ + x₃ + x₄

In fact, you are performing a multiple regression above. Multiple regression generally explains the relationship between multiple independent or predictor variables and one dependent or criterion variable.

  • In the multiple regression above, you extract the value (between 0 and 1). If is large, this means that x₁ can be predicted from the three features, and is thus highly correlated with the three features – x₂, x₃, and x₄. If is small, this means that x₁ cannot be predicted from the features, and is thus not correlated with the three features – x₂, x₃, and x₄.
  • Based on the value that is calculated for x₁, you can now calculate its VIF using the following formula:
Image by author
Image by author
  • A large value (close to 1) will cause the denominator to be small (1 minus a value close to 1 will give you a number close to 0). This will result in a large VIF. A large VIF indicates that this feature exhibits multicollinearity with the other features.
  • Conversely, a small value (close to 0) will cause the denominator to be large (1 minus a value close to 0 will give you a number close to 1). This will result in a small VIF. A small VIF indicates that this feature exhibits low multicollinearity with the other features.

(1- ) is also known as the tolerance.

  • You repeat the process above for the other features and calculate the VIF for each features:
x₂ ~ x₁ + x₃ + x₄   # regress x₂ against the rest of the features
x₃ ~ x₁ + x₂ + x₄   # regress x₃ against the rest of the features
x₄ ~ x₁ + x₂ + x₃   # regress x₄ against the rest of the features

While correlation matrix and scatter plots can be used to find multicollinearity, they only show the bivariate relationship between the independent variables. VIF ,on the other hand, shows the correlation of a variable with a group of other variables.

Implementing VIF using Python

Now that you know how VIF is calculated, you can implement it using Python, with a little help from sklearn:

import pandas as pd
from sklearn.linear_model import LinearRegression
def calculate_vif(df, features):    
    vif, tolerance = {}, {}
    # all the features that you want to examine
    for feature in features:
        # extract all the other features you will regress against
        X = [f for f in features if f != feature]        
        X, y = df[X], df[feature]
        # extract r-squared from the fit
        r2 = LinearRegression().fit(X, y).score(X, y)                

        # calculate tolerance
        tolerance[feature] = 1 - r2
        # calculate VIF
        vif[feature] = 1/(tolerance[feature])
    # return VIF DataFrame
    return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance})

Let’s Try It Out

To see VIF in action, let’s use a sample dataset named bloodpressure.csv, with the following content:

Pt,BP,Age,Weight,BSA,Dur,Pulse,Stress,
1,105,47,85.4,1.75,5.1,63,33,
2,115,49,94.2,2.1,3.8,70,14,
3,116,49,95.3,1.98,8.2,72,10,
4,117,50,94.7,2.01,5.8,73,99,
5,112,51,89.4,1.89,7,72,95,
6,121,48,99.5,2.25,9.3,71,10,
7,121,49,99.8,2.25,2.5,69,42,
8,110,47,90.9,1.9,6.2,66,8,
9,110,49,89.2,1.83,7.1,69,62,
10,114,48,92.7,2.07,5.6,64,35,
11,114,47,94.4,2.07,5.3,74,90,
12,115,49,94.1,1.98,5.6,71,21,
13,114,50,91.6,2.05,10.2,68,47,
14,106,45,87.1,1.92,5.6,67,80,
15,125,52,101.3,2.19,10,76,98,
16,114,46,94.5,1.98,7.4,69,95,
17,106,46,87,1.87,3.6,62,18,
18,113,46,94.5,1.9,4.3,70,12,
19,110,48,90.5,1.88,9,71,99,
20,122,56,95.7,2.09,7,75,99,

The dataset consists of the following fields:

  • Blood pressure (BP), in mm Hg
  • Age, in years
  • Weight, in kg
  • Body surface area (BSA), in m²
  • Duration of hypertension (Dur), in years
  • Basal Pulse (Pulse), in beats per minute
  • Stress index (Stress)

First, load the dataset into a Pandas DataFrame and drop the redundant columns:

df = pd.read_csv('bloodpressure.csv')
df = df.drop(['Pt','Unnamed: 8'],axis = 1)
df
Image by author
Image by author

Visualizing the relationships between columns

Before you do any cleanup, it would be useful to visualizing the relationships between the various columns using a pair plot (using the Seaborn module):

import seaborn as sns
sns.pairplot(df)

I have identified some columns where there seems to exists a strong correlation:

Image by author
Image by author

Calculating Correlation

Next, calculate the correlation between the columns using the corr() function:

df.corr()
Image by author
Image by author

Assuming that you are trying to build a model that predicts BP, you could see that the top features that correlates to BP are Age, Weight, BSA, and Pulse:

Image by author
Image by author

Calculating VIF

Now that you have identified the columns that you want to use for training your model, you need to see which of the columns have multicollinearity. So so let’s use our calculate_vif() function that we have written earlier:

calculate_vif(df=df, features=['Age','Weight','BSA','Pulse'])
Image by author
Image by author

Interpreting VIF Values

The valid value for VIF ranges from 1 to infinity. A rule of thumb for interpreting VIF values is:

  • 1 – features are not correlated
  • 1<VIF<5 – features are moderately correlated
  • VIF>5 – features are highly correlated
  • VIF>10 – high correlation between features and is cause for concern

From the result calculating the VIF in the previous section, you can see the Weight and BSA have VIF values greater than 5. This means that Weight and BSA are highly correlated. This is not surprising as heavier people have a larger body surface area.

So the next thing to do would be to try removing one of the highly correlated features and see if the result for VIF improves. Let’s try removing Weight since it has a higher VIF:

calculate_vif(df=df, features=['Age','BSA','Pulse'])
Image by author
Image by author

Let’s now remove BSA and see the VIF of the other features:

calculate_vif(df=df, features=['Age','Weight','Pulse'])
Image by author
Image by author

As you observed, removing Weight results in a lower VIF for all other features, compared to removing BSA. So should you remove Weight then? Well, ideally, yes. But for practical reasons, it would make more sense to remove BSA and keep Weight. This is because later on when the model is trained and you use it for prediction, it is easier to get a patient’s weight than his/her body surface area.

One More Example

Let’s look at one more example. This time you will use the Breast Cancer dataset that comes with sklearn:

from sklearn import datasets
bc = datasets.load_breast_cancer()
df = pd.DataFrame(bc.data, columns=bc.feature_names)
df
Image by author
Image by author

This dataset has 30 columns, so let’s only focus on the first 8 columns:

sns.pairplot(df.iloc[:,:8])
Image by author
Image by author

You can immediately observe that some features are highly correlated. Can you spot them?

Let’s calculate the VIF for the first 8 columns:

calculate_vif(df=df, features=df.columns[:8])
Image by author
Image by author

You can see that following features have large VIF values:

Image by author
Image by author

Let’s try to remove these features one by one and observe their new VIF values. First, remove mean perimeter:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture', 
                               'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                               'mean concave points'])
Image by author
Image by author

Immediately there is a reduction of VIFs across the board. Let’s now remove mean area:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                               'mean concave points'])
Image by author
Image by author

Let’s now remove the mean concave points, which has the highest VIF:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                               'mean concavity',
                             # 'mean concave points'
                              ])
Image by author
Image by author

Finally, let’s remove mean concavity:

calculate_vif(df=df, features=['mean radius', 
                               'mean texture',
                             # 'mean area', 
                               'mean smoothness', 
                               'mean compactness', 
                             # 'mean concavity',
                             # 'mean concave points'
                              ])
Image by author
Image by author

And now all the VIF values are under 5.

Summary

In this article, you learned about the difference between correlation, collinearity, and multicollinearity. In particular, you learned that multicollinearity happens when a feature exhibits a linear relationship with two or more features. To detect multicollinearity, one method is to calculate the Variance Inflation Factor (VIF). Any feature that has a VIF more than 5 should be removed from your training dataset.

It is important to note that VIF only works on continuous variables, and not categorical variables.

Join Medium with my referral link – Wei-Meng Lee


Related Articles