Linear Regression and its assumptions

Assumptions of linear regression like Collinearity, Multivariate Normality, Autocorrelation, Homoscedacity are explained with examples

Published in

Towards Data Science

10 min readJan 24, 2020

Last week, I was helping my friend to prepare for an interview for a data scientist position. I searched over the internet for the relevant questions and found that the question “What are the assumptions involved in Data Science ?” occurred very frequently in my searches. Although most of the blogs provided the answer to this question, still, details were missing. I researched the basic assumptions and would like to share my findings with you. First, I will tell you the assumptions in short and then will dive in with an example.

The basic assumptions for the linear regression model are the following:

A linear relationship exists between the independent variable (X) and dependent variable (y)
Little or no multicollinearity between the different features
Residuals should be normally distributed (multi-variate normality)
Little or no autocorrelation among residues
Homoscedasticity of the errors

Now let’s see how to verify an assumption and what to do in case an assumption is not true. Let us focus on each of these points one by one. We will take a dataset with different features of wine. This dataset has been used in several examples by fellow data scientists and is made publicly available by the UCI machine learning repository ( Wine_quality data or the CSV file from here) another dataset I will use is temperature dataset (available here). I came across these datasets in an article by Nagesh Singh Chauhan. I will advise you to download the data and play with it to find the number of rows, columns, whether there are rows with NaN values, etc. Pandas is a really good tool to read CSV and play with the data. If you are new to Pandas try reading the file using:

Now you can play with data using, dataset.head()/dataset.tail()/dataset.describe, etc.

Linear Relationship

A linear relationship should exist between the independent variable and the dependent variable. This could easily be verified using scatter plots. I will use the temperature dataset to show the linear relationship. You may plot Tmax vrs T_min (or T_avg vrs T_min) as shown in figure 1. If you have played with the data you might have observed that there is a month column thus we can even label (colour code) the scatter markers according to months just to see if there is a clear distinction of temperature according to month (figure 1b).

the output should look like:

Here you can observe that T_max and T_min follows a linear trend. You can imagine a straight line passing through the data. Thus we have made sure that our data follows the first assumption.

What if data wasn’t linear?

Maybe it’s a wrong idea to fit that data in a linear model. It might be more suitable for a polynomial model (nonlinear regression).
Data transformation, if y seems exponential to x how about plotting a curve between y and log(x) (or y vs x squared). Now you can do linear regression on this data. But why would you want to do that? Because you (and me too) know more about linear regression than non-linear regression. We have so many tools built for linear regression, hypothesis validation, etc which may not be readily available for non-linear regression. Moreover, once you have fitted the linear regression between y and transformed x it is not difficult to unwind to the original relationship y vs x.

2. No Multicollinearity among different features

Why ‘no multicollinearity’?

When we are performing linear regression analysis we are looking for a solution of type y = mx + c, where c is intercept and m is the slope. The value of ‘m’ determines how much y would change while changing x by unity. For a multivariate linear regression same relationship holds for the following equation: y = m1x1 +m2x2 +m3x3 … + c. Ideally, m1 denotes how much y would change on changing x1 but what if a change in x1 changes x2 or x3. In such a case the relationship between y and m1 (or m2, m3, etc) would be very complex.

How to check ‘multicollinearity’?

Seaborn provides a pairplot function which plots attributes of a variable among themselves. These plots scatter plots and we need to look if any of these attributes are showing a linear relationship. This is the easiest tool to visualize and feel the linear relationship of different attributes but it is good only if the number of features involved is limited to 10–12 features. In the next section, we will discuss what to do if more features are involved. Let’s first draw our pairplot. For this, I will use Wine_quality data as it has features that are highly correlated (figure 2).

the output should be an 11x11 figure like this:

If you observe feature like pH and fixed acidity show a linear dependence (with negative covariance). If you observe the complete plot you will find that

fixed acidity is correlated with citric acid, density, and acidity
volatile acidity is correlated with citric acid

I leave on you to find other colinearity relationships. Let’s take a detour to understand the reason for this colinearity. If you remember your high school chemistry, the pH is defined as

pH =- log [H+] = — log(concentration of acid)

thus it is very intuitive that pH and citric acid or volatile acidity are negatively correlated.

This exercise also serves an example of how domain knowledge about the data helps to work with data more efficiently. Probably this is the reason that data science is open for scientists from all the fields of science.

How collinear are the different features?

In the previous section, we plotted the different features to check if they are collinear or not. In this section, we will answer what is the measure of there collinearity?

We can measure correlation (note ‘correlation’ not ‘collinearity’), if the absolute correlation is high between two features we can say these two features are collinear. To measure the correlation between different features, we use the correlation matrix/heatmap. To plot the heatmap, we can use seaborn’s heatmap function (figure 3).

the output should be a colour-coded matrix with correlation annotated in the grid:

Now depending upon your knowledge of statistics you can decide a threshold like 0.4 or 0.5 if the correlation is greater than this threshold than it is considered a problem. Rather than giving a clear answer here, I will pose a question to you. What should be an ideal value of this correlation threshold? Now let’s say your dataset contains 10, 000 examples (or rows) would you change your answer if dataset contained 100,000 or just 1000 examples. Maybe you can give your answers in comments.

What to do if the correlation is high?

Let’s say you have made the list of the colinear relationships between different features. Now what? The first thing to think is if a feature can be removed. If two features are directly related for example amount of acid and pH, I wouldn’t hesitate to remove one. As pH is nothing but the negative log of the amount of acid. It is like having the same information in two different scales. But bigger questions are:

which feature to remove pH or amount of acid?
which feature should be removed next?

I don’t have a clear answer to this. What I have learnt is to calculate the variance inflation factor VIF. It is defined as the inverse of tolerance, while tolerance is 1- R2. While I will discuss the VIF but in general there are following methods available for treating the colinearity:

a) Greedy elimination

b) Recursive feature elimination

c) Lasso regularization (L1 regularization)

d) Principle component analysis (PCA)

We will use VIF values to find which feature should be eliminated first. And then we will recalculate the VIF to check if any other features need to be eliminated. If we work on correlation scale the correlation among different variables before and after an elimination doesn’t change. The VIF give this advantage to measure the effect of elimination. We use statsmodels, oulier_influence module to calculate VIF. This model requires us to add a constant variable in the model to calculate VIF thus in the code we use ‘add_constant(X)’, where X is our dataset which contains all the features (the quality column is removed as it contains the target value).

The method I follow is to eliminate a feature with the highest VIF and then recalculate the VIF. If VIF value is/are greater than 10 then remove the feature with next highest VIF or else we are done with dealing the multicollinearity.

For checking other assumptions we need to perform linear regression. I have used the scikit learn linear regression module to do the same. We split the model in test and train model and fit the model using train data and do predictions using the test data.

3. Multivariate Normality

The residuals should be normally distributed. This can be easily checked by plotting QQ plot. We will use statsmodels, qqplot for plotting it. We first import the qqplot attribute and then feed it with residue values. We get the Q-Q plot as figure 4.

Ideally, it should have been a straight line. This pattern shows that there is something seriously wrong with our model. We can not rely on this regression model. Let’s check if other assumption holds true or not.

4. No autocorrelation

Durbin Watson’s test of the d-test could help us analyze if there is any autocorrelation between the residues. Since there is a lot of material on the internet about this test, I will provide you with another way. In figure 5 show the plot of residues with respect to each attribute to check if residues show any correlation. In the code below dataset2 is the pandas data frame of X_test.

the output will be a series of plots (1 plot/column of test dataset)

Figure 5 shows how the data is well distributed without any specific pattern thus verifying no autocorrelation of the residues. We need to verify this in all the plots (X axis is the feature, so there will be as many plots as there are features).

5. Homoscedacity

Plotting the scatter plots of the errors with the fit line will show if residues are forming any pattern with the line. If yes then data is not homoscedastic or the data is heteroscedastic. While if the scatter plot doesn’t form any pattern and is randomly distributed around the fit line than the residues are homoscedastic.

To plot residuals (y_test — y_pred) with respect to fitted line one can write the equation of the fitted line (by using *.coeff_ and *.intercept). Use this equation to get y values but plot these y values on the x-axis as we want to plot the residuals with respect to the fit line (X-axis should be the fit line). Now the pattern of residues can be observed. Below is the code for same:

If you are new to python and wish to stay away from writing your code you may perform the same task by using ‘Redidualsplot’ module of yellowbrick regressor. Here you just need to fit in test and training data and rest will be done by the model itself. Here I am using LinearRegression() model of scikit learn you may choose to use a different one.

Anyhow, use any of the above methods you will end up getting the same result, which is shown below in figure 6.

Here the residues show a clear pattern, indicating something wrong with our model.

Thus our model fails to hold on multivariate normality and homoscedasticity assumptions(figure 4 and figure 6 respectively). Here it suggests that either the data is not suitable for linear regression or the given features can’t really predict the quality of wine based on given features. But this was a good exercise to show the basic assumptions of linear regression. In case you have a better solution for the problem let me know.

Constructive criticism/suggestions are welcome.

Linear Regression and its assumptions

Assumptions of linear regression like Collinearity, Multivariate Normality, Autocorrelation, Homoscedacity are explained with examples

Written by Manish Sharma