The world’s leading publication for data science, AI, and ML professionals.

Assumptions of Multiple Linear Regression

Implementation of assumptions in python

Image from Unsplash
Image from Unsplash

Introduction

Linear regression is the core process for various prediction analytics. By definition, linear regression refers to fitting of two continuous variables of interest. Not all datasets can be fitted into a linear fashion. There are few assumptions that must be fulfilled before jumping into the regression analysis. Some of those are very critical for model’s evaluation.

  1. Linearity
  2. Multicollinearity
  3. Homoscedasticity
  4. Multivariate normality
  5. Autocorrelation

Getting hands dirty with data

For the purpose of demonstration, I will utilize open source datasets for Linear Regression. The "Fish" dataset is under GPL 2.0 license. We will go through various requirements to establish linear regression analysis on this dataset.

Linearity

The first assumption is very obvious and straightforward. There variable that we are trying to fit should maintain a linear relationship. If there is no linear relationship, the data can be transformed to make it linear. These type of transformation include taking logs on the response data or square rooting the response data. Checking scatterplot is the best and easiest way to check the linearity. Let’s do a linearity check between weight and height variables. The scatterplot shows that there is little to none linearity between these two variables and therefore we need to transform the variables. We can perform log operation on both and obtain a more linear scatterplot.

Non linearity observed between the variables (Image by Author)
Non linearity observed between the variables (Image by Author)
Linearity increased after log transformation (Image by Author)
Linearity increased after log transformation (Image by Author)

Clearly the data still shows some sort of bimodality even after the log transformation. Some portion of the data lies at the upper half of the weight distribution and the remaining data points lie separately from the former distribution. In this case, we may need to collect more data between 2.5 to 3.56 cm and eventually we can have a more normal distribution.

Takeaway 1

→ Check scatterplot of the response variable against all the independent variables.

→ If no linearity is observed, transform the data.

Multi-collinearity

Multicollinearity is observed when two or more independent variables are correlated to one another. If that is the case, the model’s estimation of the coefficients will be systematically wrong. One can check Variance Inflation Factor (VIF) to determine the variables which are highly correlated and potentially drop those variables from the model. R² is a measure of how correlated the variables are and VIF is determined from this R² value. If the variables have high correlation, VIF value shoots up. Typically VIF value >5 indicates the presence of multicollinearity. The minimum value of VIF is 1 which is evident for the equation and it indicates that there is no multicollinearity out there.

VIF formula
VIF formula
VIF values
VIF values

Observing the VIF values, it is obvious that all the variables are highly correlated. IF our response variable is "Weight", we can keep any one of the remaining five variables as our independent variable. Multicollinearity can occur in the dataset for various reasons. It may be there due to the data collection process. Some variables may be duplicated and others may be transformed which may give rise to multicollinearity.

Pearson’s coefficient of multicollinearity is another metric in this scenario. It can be visualized in matrix form which is great.

Pearson's coefficient of multicollinearity (Image by Author)
Pearson’s coefficient of multicollinearity (Image by Author)

Takeaway 2

→ Check scatterplot or R² value to determine if multicollinearity is present.

→ Also check VIF values to screen out the variables.

Homoscedasticity

Homoscedasticity is another assumption for multiple linear regression modeling. It requires equal variance among the data points on both side of the linear fit. If it is not the case, the data is heteroscedastic. Typically the quality of the data gives rise to this heteroscedastic behavior. Instead of linear increase of decrease, if the response variable exhibits cone shaped distribution, we can say that variance cannot be equal at every point of the model.

Cone-shaped distribution of the heteroscedastic data (Image by Author)
Cone-shaped distribution of the heteroscedastic data (Image by Author)

I have another article on this feature of the data. Readers may find it useful.

Heteroscedasticity in Regression Model

In our "Fish" dataset, the variable "Weight" shows similar behavior in the scatterplot.

Unequal variance around the fitting line (Image by Author)
Unequal variance around the fitting line (Image by Author)

When the residuals are plotted against the predicted values, it provides an indication too of this heteroscedasticity. The plot below shows some sort of pattern of the residuals. In this scenario, transformation of the response variable may be good step to minimize heteroscedasticity.

Residual scatterplot (Image by Author)
Residual scatterplot (Image by Author)

Takeaway 3

→ Check scatterplot of the response variable if the variance is variable (not constant) or any cone shaped plot.

→ Transform the variable to minimize heteroscedasticity.

Multivariate normality

This assumption states that the residuals from the model is normally distributed. After determining the model parameters, it is good to check the distribution of the residuals. Apart from the visual of the distribution, one should check the Q-Q plot for better understanding of the distribution. Readers are encouraged to go through the basics and implementation of Q-Q plot outlined in the article below.

Understand Q-Q plot using simple python

In our dataset, we can visualize the distribution as well as Q-Q plot but let’s generate some synthetic data for better understanding.

Generated data with fitted line (Image by Author)
Generated data with fitted line (Image by Author)
Residual distribution (Image by Author)
Residual distribution (Image by Author)
Q_Q plot (Image by Author)
Q_Q plot (Image by Author)

The Q-Q plot is deviating from the 45⁰ line which seems to represent an under-dispersed dataset which has thinner tails than a normal distribution. Again non-linear transformation helps to establish multivariate normality in this case.

Takeaway 4

→ Check distribution of the residuals and also Q_Q plot to determine normality

→ Perform non-linear transformation if there is lack of normality

Apart from these scenarios, there is "no autocorrelation" assumption which basically tells us that there should be no specific pattern in the residual scatterplot. One residual at a specific location should not be dependent on it’s surrounding residuals. Constant variance assumption is somewhat related to this.

Conclusion

We have demonstrated the implementation of assumptions checking for multiple linear regression. Linearity and multicollinearity are more important than other assumptions. In various machine learning or statistical problem, linear regression is the simplest of the solutions. However, the user should be equally careful about the assumptions outlined here and take necessary steps for minimizing the effects arising from non-linearity.

Github Page


Related Articles