What is multicollinearity?
In multiple regression, multicollinearity occurs when a predictor (independent variable) highly correlates with one or more of the other predictors in the model.
Why it matters?
### Multiple regression equation:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βᵢXᵢ + ε
Theoretically, as we can see in the equation, multiple regression uses more than one predictor to predict the value of the dependent variable.
By the way, multiple regression works by determining the effect of the mean changing of a unit in a predictor on the dependent variable while keeping other predictors constant.
If a predictor highly correlates with the others, it will be tough to change that one without changing the others.
The effect of multicollinearity
Simply put, multicollinearity affects the model coefficients. With small changes in data, it can affect coefficient estimates. Thus, it becomes difficult to interpret the role of each independent variable.
This article will explain and show the phenomenon by applying Data Visualization. Before starting, let’s discuss how to determine whether the model has multicollinearity.
How to detect multicollinearity?
### VIF (Variance Inflation Factor) equation:
VIF = 1/(1 - Rᵢ²)
We can use VIF (Variance Inflation Factor) to estimate how much the variance of a Regression coefficient is inflated due to multicollinearity.
The calculation is done by regressing a predictor against other predictors to obtain the R-squared values. Then, the obtained R-squared will be used to calculate the VIF values. The ‘i’ in the equation represents the predictor.
The following criteria can be used to interpret the result:
#1 : not correlated
#1–5 : moderately correlated
#> 5 : highly correlated
Now that we have done with the explanation part. Let’s continue to the creating model part.
Multiple regression models
Start with preparing multiple regression models for plotting. Firstly, we will create two models from a dataset; one with moderately correlated variables and another with highly correlated variables.
Then we will slightly modify the data to see which model will be affected more by the small modification. Start with importing libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Getting data
For example, this article will work with the Cars Data dataset, which has a public domain license. It can be freely and directly downloaded from the Seaborn library with seaborn.load_dataset()
function.
The dataset contains 392 cars’ prices and features between 1970–1982 in USA, Europe, and Japan. More information about the dataset can be found here: link.
data = sns.load_dataset('mpg')
data.dropna(inplace=True) #remove null rows
data.head()
Doing exploratory data analysis to understand the dataset is always a good idea.
data.describe()
The ranges (max value – min value) in the columns are quite different. Thus, performing the standardization can help interpret the coefficients later.
data = data.iloc[:,0:-2] # select columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_s = scaler.fit_transform(data)
df = pd.DataFrame(df_s, columns=data.columns)
df.head()
Continue with plotting a heat map to show the correlation among the variables.
plt.figure(figsize=(9, 6))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.show()
The result shows that there are some variables highly correlated with other variables. If every predictor is put in a multiple regression model to predict the ‘mpg’ value, multicollinearity will affect the model.
Calculating the VIF values
This can be quickly proved by calculating the VIF values with the Statsmodels library.
from statsmodels.stats.outliers_influence import variance_inflation_factor as vif
vif_data = pd.DataFrame()
vif_data["feature"] = df.columns
# calculating VIF for each feature
vif_data["VIF"] = [vif(df.values, i) for i in range(len(df.columns))]
print(vif_data)
Some VIF values are pretty high (> 5), which can be interpreted as highly correlated. If we directly put every predictor in a multiple regression model, the model can suffer from the multicollinearity problem.
Next, only some predictors will be selected. This article will work with two models: one with moderately correlated and another with highly correlated predictors.
The first multiple regression model uses ‘cylinders’ and ‘acceleration’ to predict the ‘mpg,’ while the second uses’ cylinders’ and ‘displacement.’ Let’s calculate the VIF values again.
df1 = df[['mpg', 'cylinders', 'acceleration']]
df2 = df[['mpg', 'cylinders', 'displacement']]
# VIF dataframe1
vif_data1 = pd.DataFrame()
vif_data1['feature'] = df1.columns
vif_data1['VIF'] = [vif(df1.values, i) for i in range(len(df1.columns))]
print('model 1')
print(vif_data1)
# VIF dataframe2
vif_data2 = pd.DataFrame()
vif_data2['feature'] = df2.columns
vif_data2['VIF'] = [vif(df2.values, i) for i in range(len(df2.columns))]
print('model 2')
print(vif_data2)
The difference between these two models is changing from the variable ‘acceleration’ to ‘displacement.’ However, it can be noticed that none of the VIF values in the first model are higher than 5, while the second model has relatively high VIF values, more than 10.
Creating a multiple regression model
We will use the OLS function from the statsmodels library to create a multiple regression model.
import statsmodels.api as sm
y1 = df1[['mpg']]
X1 = df1[['cylinders', 'acceleration']]
lm1 = sm.OLS(y1, X1)
model1 = lm1.fit()
model1.summary()
Plotting the model
From the obtained model, we will define a function to run the model on a mesh grid for use in the next step.
def run_model(v1, v2, pd_):
mesh_size = 0.02
x_min, x_max = pd_[[v1]].min()[0], pd_[[v1]].max()[0]
y_min, y_max = pd_[[v2]].min()[0], pd_[[v2]].max()[0]
xrange = np.arange(x_min, x_max, mesh_size)
yrange = np.arange(y_min, y_max, mesh_size)
xx, yy = np.meshgrid(xrange, yrange)
return xx, yy, xrange, yrange
Here comes the fun part. Let’s plot the model with the Plotly library, which helps create an interactive plot easily. Thus, we can interact with the visualization, such as zooming or rotating.
import plotly.express as px
import plotly.graph_objects as go
from sklearn.svm import SVR
#run and apply the model
xx1, yy1, xr1, yr1 = run_model('cylinders', 'acceleration', X1)
pred1 = model1.predict(np.c_[xx1.ravel(), yy1.ravel()])
pred1 = pred1.reshape(xx1.shape)
# plot the result
fig = px.scatter_3d(df1, x='cylinders', y='acceleration', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr1, y=yr1, z=pred1, name='pred1'))
fig.show()
We can perform the same process with the second model, which has the multicollinearity problem.
y2 = df2[['mpg']]
X2 = df2[['cylinders', 'displacement']]
lm2 = sm.OLS(y2, X2)
model2 = lm2.fit()
model2.summary()
#run and apply the model
xx2, yy2, xr2, yr2 = run_model('cylinders', 'displacement', X2)
pred2 = model2.predict(np.c_[xx2.ravel(), yy2.ravel()])
pred2 = pred2.reshape(xx2.shape)
# plot the result
fig = px.scatter_3d(df2, x='cylinders', y='displacement', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr2, y=yr2, z=pred2, name='pred2'))
fig.show()
Modifying the dataset
As previously mentioned, small modifications in data can affect coefficient estimates. To prove that, we will randomly select a row and change the values. For example, they will be multiplied by 1.25.
After that, we can conduct the same process and plot the new and the original multiple regression models in the same plot to see the result.
# randomly modify a row in the dataframe
from random import *
x = randint(1, len(df))
mod_list = [i*1.25 for i in df.iloc[x,:]] #multiply by 1.25
df_m = df.copy()
df_m.iloc[x] = mod_list
df_m
Calculating the VIF values
From the modified dataset, calculate the VIF values to compare the differences.
df_m1 = df_m[['mpg', 'cylinders', 'acceleration']]
df_m2 = df_m[['mpg', 'cylinders', 'displacement']]
# VIF dataframe1
vif_data1 = pd.DataFrame()
vif_data1['feature'] = df_m1.columns
vif_data1['VIF'] = [vif(df_m1.values, i) for i in range(len(df_m1.columns))]
print('model m1')
print(vif_data1)
###
# VIF dataframe2
vif_data2 = pd.DataFrame()
vif_data2['feature'] = df_m2.columns
vif_data2['VIF'] = [vif(df_m2.values, i) for i in range(len(df_m2.columns))]
print('model m2')
print(vif_data2)
The new VIF values calculated from the modified datasets are slightly changed. Both models remain with the same conditions: the first and second models’ predictors are moderately and highly correlated, respectively.
Creating multiple regression models
To compare the change in the model coefficients, let’s build the models from the new datasets for plotting them.
y_m1 = df_m1[['mpg']]
X_m1 = df_m1[['cylinders', 'acceleration']]
lm_m1 = sm.OLS(y_m1, X_m1)
model_m1 = lm_m1.fit()
model_m1.summary()
Next, do the same process with the second model that has highly correlated predictors.
y_m2 = df_m2[['mpg']]
X_m2 = df_m2[['cylinders', 'displacement']]
lm_m2 = sm.OLS(y_m2, X_m2)
model_m2 = lm_m2.fit()
model_m2.summary()
From the tables above, they may seem to have little changes in the coefficients. By the way, this happens because we randomly modified just one row of the dataset. This is not enough, and too soon to assume that multicollinearity affects the model coefficients.
With the concept and the code we have done so far, the for loop function in Python will be applied to modify the values in each row, one at a time. Then, compare the absolute change in the coefficients with the original model.
Start with defining a function to compare the coefficient values.
def compare_cef(base_m, mod_m, col):
val_base = base_m.summary().tables[1].as_html()
ml = pd.read_html(val_base, header=0, index_col=0)[0]
val_diff = mod_m.summary().tables[1].as_html()
ml_m = pd.read_html(val_diff, header=0, index_col=0)[0]
df_ = pd.DataFrame(abs(ml.iloc[:,0] - ml_m.iloc[:,0]))
df_.rename(columns={'coef': 'r '+str(col+1)}, inplace=True)
return df_
Use the for loop function to modify the rows, one row at a time.
keep_df1, keep_df2 = [], []
for n in range(len(df)):
mod_list = [i*1.25 for i in df.iloc[n,:]]
df_m = df.copy()
df_m.iloc[n] = mod_list
df_m1 = df_m[['mpg', 'cylinders', 'acceleration']]
y_m1, X_m1 = df_m1[['mpg']], df_m1[['cylinders', 'acceleration']]
lm_m1 = sm.OLS(y_m1, X_m1)
mdl_m1 = lm_m1.fit()
df_m2 = df_m[['mpg', 'cylinders', 'displacement']]
y_m2, X_m2 = df_m2[['mpg']], df_m2[['cylinders', 'displacement']]
lm_m2 = sm.OLS(y_m2, X_m2)
mdl_m2 = lm_m2.fit()
df_diff1 = compare_cef(model1, mdl_m1, n)
df_diff2 = compare_cef(model2, mdl_m2, n)
keep_df1.append(df_diff1)
keep_df2.append(df_diff2)
df_t1 = pd.concat(keep_df1, axis=1)
df_t2 = pd.concat(keep_df2, axis=1)
df_t = pd.concat([df_t1, df_t2], axis=0)
df_t
Visualize the obtained DataFrame with a heat map.
plt.figure(figsize=(16,2.5))
sns.heatmap(df_t, cmap='Reds')
plt.xticks([])
plt.show()
From the heat map plot, the first two rows exhibit the absolute change in the coefficients between the models with moderately correlated variables before and after a small change in data.
The last two rows compare the same thing between the models with highly correlated variables before and after a slight modification in data.
It can be interpreted that the model with highly correlated predictors tends to have more unstable coefficients when the change in the data occurs, while the one with moderately correlated predictors suffers less.
Plotting the multiple regression models
To visualize the change in coefficients, as an example, I will modify a row in the dataset, create a new model and plot it with the original one. The new models will be shown with the ‘viridis’ color palette (yellow-green), while the original models are plotted in the default color (orange-blue).
If you want to select and modify other rows manually, please change the code below.
x = 6 #select row number
mod_list = [i*1.25 for i in df.iloc[x,:]]
df_m = df.copy()
df_m.iloc[x] = mod_list
y_m1 = df_m[['mpg']]
X_m1 = df_m[['cylinders', 'acceleration']]
lm_m1 = sm.OLS(y_m1, X_m1)
model_m1 = lm_m1.fit()
y_m2 = df_m[['mpg']]
X_m2 = df_m[['cylinders', 'displacement']]
lm_m2 = sm.OLS(y_m2, X_m2)
model_m2 = lm_m2.fit()
Run and plot the multiple regression models with moderately correlated predictors.
# run the model
pred_m1 = model_m1.predict(np.c_[xx1.ravel(), yy1.ravel()])
pred_m1 = pred_m1.reshape(xx1.shape)
# plot the result
fig = px.scatter_3d(df_m, x='cylinders', y='acceleration', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr1, y=yr1, z=pred1, name='pred'))
fig.add_traces(go.Surface(x=xr1, y=yr1, z=pred_m1, name='pred_m1',
colorscale = 'viridis_r'))
fig.show()
It can be seen that both models are overlapping since the colors of the two planes are mixed in the result. Lastly, do the same process with the models with moderately correlated predictors.
# run the model
pred_m2 = model_m2.predict(np.c_[xx2.ravel(), yy2.ravel()])
pred_m2 = pred_m2.reshape(xx2.shape)
# plot the result
fig = px.scatter_3d(df_m, x='cylinders', y='displacement', z='mpg')
fig.update_traces(marker=dict(size=5))
fig.add_traces(go.Surface(x=xr2, y=yr2, z=pred2, name='pred'))
fig.add_traces(go.Surface(x=xr2, y=yr2, z=pred_m2, name='pred_m2',
colorscale = 'viridis_r'))
fig.show()
The result shows that both models are not perfectly overlapped. They cut each other and produced a little gap between the models.
Please consider that these plots are selected by randomly modifying a row in the dataset. The first model is not multicollinearity free. It still has moderately correlated predictors. From the heat map, It is also affected by the change in some cases. However, the change produces fewer consequences when compared with the second model.
Summary
This article applies data visualization to express the effect of multicollinearity on multiple regression models by comparing two models, one with moderately correlated predictors and another with highly correlated predictors. Modifying the original data is also conducted to see which models will suffer more from small changes in data.
The result shows that the more a model has highly correlated predictors, the more model coefficients suffer from the unstable changing data. Thus, explaining each predictor in the model with the multicollinearity problem can be difficult.
These are my data visualization articles that you may find interesting:
- 8 Visualizations with Python to Handle Multiple Time-Series Data (link)
- 9 Visualizations with Python that Catch More Attention than a Bar Chart (link)
- 7 Visualizations with Python to Express changes in Rank over time (link)
- Battle Royale – Comparison of 7 Python Libraries for Interactive Financial Charts (link)
References
- Wikimedia Foundation. (2023, February 22). Multicollinearity. Wikipedia. https://en.wikipedia.org/wiki/Multicollinearity
- Choueiry, G. (2020, June 1). Quantifying health. QUANTIFYING HEALTH. https://quantifyinghealth.com/vif-threshold/
- Frost, J. (2023, January 29). Multicollinearity in regression analysis: Problems, detection, and solutions. Statistics By Jim. https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/
- Rob Taylor, P. (2022, December 1). Multicollinearity: Problem, or not? Medium. https://towardsdatascience.com/multicollinearity-problem-or-not-d4bd7a9cfb91
- Stephanie. (2020, December 16). Variance inflation factor. Statistics How To. https://www.statisticshowto.com/variance-inflation-factor/
- Cars data – dataset by dataman-udit. data.world. (2020, May 24). https://data.world/dataman-udit/cars-data