Data Science
How to check the main assumptions supporting linear regression models?
Most firms that think they want advanced AI/ML really just need linear regression on cleaned-up data [Robin Hanson]
Beyond the sarcasm of this quote, there is a reality: of all the statistical techniques, regression analysis is often referred to as one of the most significant in business analysis. Most companies use regression analysis to explain a particular phenomenon, build forecasting or to make predictions. These new insights can be extremely valuable in understanding what can make a difference in the business.
When you work as a Data Scientist, building a linear regression model can sound pretty dull especially when it’s all about AI around you. However, I want to stress that mastering the main assumptions of linear regression models is much more elaborated than it looks and requires some solid foundation in Statistics. Another thing is that linear regression models are part of the same family of algorithms a.k.a Generalised Linear Models (GLM). GLM is an important topic for Data Scientist because they fit a broad variety of real-world phenomenon.
Maybe that’s why in November 2017, Jame Le ranked linear regression models in the top 10 of Statistical Techniques Data Scientists Need to Master.
Now that it’s clear why we should care about the Linear Regression Model, let’s get our hands on it.
You ran a linear regression analysis and the results were significant (or not). You might think that you’re done. No, not yet my friend.
After running the regression analysis, you should check if the model performed well for data. Maybe you start with regression results, such as slope coefficients, p-values, or R-squared. These parameters can give you a first glimpse at how well a model represents given data. But that’s not the whole picture though.
In this post, I’ll walk you through how to check the main assumptions supporting linear regression models.
But wait! First things first…
The 4 peace-keepers
There are four principal assumptions which support using a linear regression model for the purpose of inference or prediction:

- Linearity: Sounds obvious! We must have a linear relationship between our features and responses. This is required for our estimator and predictions to be unbiased.
The next ones are concerning the residuals:
2. Normality:Residuals must be **** Normally distributed (i.e variance tends to 1 and mean tend to zero). This is necessary for a range of statistical tests, such as the t-test. We can relax this assumption in large samples due to the central limit theorem.
3. Homoscedasticity:It means that the residuals have constant variance no matter the level of the dependent variable.
4. Independence: Residuals must be totally free of autocorrelation.
The 5th trouble maker

For multiple linear regression, you also need to check:
5. the Absence of multicollinearity:Multicollinearity refers to when two predictors (or more) are providing the same information about the response variable. This can be a problem for the model as it generates:
- Redundancy → leading to unreliable coefficients of the predictors (especially for linear models)
- High variance of the estimator → tends to overfit meaning that the algorithm tends to model the random noise in the training data, rather than the intended output.
- An important predictor can become unimportant.
Study case: Advertising Budgeting and Sales Forecasting
To illustrate this article, we are gonna try to predict the sales of a company using 3 predictors: budget invested on Youtube ads, budget invested on Facebook ads ,and budget invested on Newspaper ads. All variables are expressed in thousands of US dollars. The dataset used can be found in Kaggle.
Dataset
df.head(7)

Python Library: Statsmodel
For this article, I choose to use StatsModel over Scikit-learn. Despite the fact that StatsModel doesn’t have the variety of options that Scikit-learn offers, it provides statistics and econometric tools that are top of the line and validated against other statistics software like Stata and R.
🚩 Here you can find an interesting article about these 2 libraries:
Output
Let’s get started!
import statsmodels.formula.api as smf
reg = smf.ols('sales~youtube+facebook+newspaper',data=df).fit()
print(reg.summary())

✅ If we check the "basics" parameters, here is what we can see:
- R-squared is quite high
- Prob (F-statistic) is very low
- p-value < alpha risk (5%) except for the predictor newspaper
R-squared: In case you forgot or didn’t know, R-squared and Adjusted R-squared are statistics that often accompany regression output. They are both supposed to assess the model performance with values in a range from 0 to 1.
Generally, you choose the models that have higher adjusted and predicted R-squared values.
Now, I have this question for you: Who is checking the goodness of fit of the model with R- squared?
I do.Or at least, I did.
Until I read a few articles on the topic, showing that R-squared does NOT measure the goodness of fit because it can be:
- Arbitrarily low even when the model is correct,
- Dangerously close to 1 even when the model is wrong.
I’m not gonna develop further into this article how come it’s possible but you can find some very good info here:
Five Reasons Why Your R-squared can be Too High – Statistics By Jim
University of Virginia Library Research Data Services + Sciences
So, based on the 3 parameters above, you could think that your regression model is quite good to predict the sales based on the advertising budget. Then you might want to remove the newspaper predictor as it seems not significant for the model.
Of course, it would be terribly wrong to stop our analysis here.
First and foremost, you shouldn’t follow blindly what R-squared says. Then, because we don’t know if the 5 key assumptions are verified. So we basically don’t know if we are in the spectrum where this model is performing well.
It is time for me, to walk you through each assumption and how to verify them.
1. Linearity
If you try to fit a linear model to data which is nonlinearly or nonadditive, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. To confirm the linearity, we can for example:
- Apply the Harvey-Collier multiplier test.
import statsmodels.stats.api as sms
sms.linear_harvey_collier(reg_multi)
>> Ttest_1sampResult(statistic=-1.565945529686271, pvalue=0.1192542929871369)
✅ Small p-value shows that there is a violation of linearity. Here the p-value is higher than the alpha risk (5%) meaning that the linearity condition is verified.
- Observed vs Predicted values
# Plot Predicted values VS Real values
df['Predicted_sales'] = reg_multi.predict()
X_plot = [df['sales'].min(), df['sales'].max()]
ax = sns.scatterplot(x="sales", y="Predicted_sales", data=df)
ax.set(xlabel='sales', ylabel='Predicted_sales')
plt.plot(X_plot, X_plot, color='r')
plt.show()

✅ If the linearity condition is verified, the points should be symmetrically distributed around a diagonal line with a roughly constant variance.
- Studentized residuals vs Fitted values
Studentized residuals are more effective in detecting outliers, checking the linearity ,and in assessing the equal variance assumption than the standardized residuals.
# Get the Studentized Residual
student_residuals = pd.Series(np.abs(reg.get_influence().resid_studentized_internal))
# Plot the Studentized Residual
fig, ax = plt.subplots()
ax.scatter(fitted, student_residuals, edgecolors = 'k')
ax.set_ylabel('Studentized Residuals')
ax.set_xlabel('Fitted Values')
ax.set_title('Scale-Location')
plt.show()

✅ Ideally, all residuals should be small and unstructured (i.e not forming any clusters). It would mean that the regression analysis has been successful in explaining the essential part of the variation of the dependent variable. However, if residuals exhibit a structure or present any special aspect that does not seem random, it sheds a "bad light" on the regression.
2. Normality
You can start your analysis of the residuals by verifying whether or not the residuals are normally distributed. Strictly speaking, the non-normality of the residuals is an indication of an inadequate model. It means that the errors the model makes are not consistent across variables and observations (i.e. the errors are not random).
There are many ways to execute that step, I’m personally a big fan of data visualization so I always start with these 2 plots to check the Normality assumption:
# Plot residual Q-Q plot
import scipy as sp
fig, ax = plt.subplots(figsize=(6,2.5))
_, (__, ___, r) = sp.stats.probplot(residuals, plot=ax, fit=True)

- Histogram of residuals distribution
# Get Residuals
residuals = reg_multi.resid
# Plot Histogram of the residuals
plt.hist(residuals, density=True)
plt.xlabel('Residuals')
plt.title('Residuals Histogram')
plt.show()

✅ If the residuals are normally distributed, we should see a bell-shaped histogram centered on 0 and with a variance of 1.
Other parameters can be used to go deeper into the understanding:
- Omnibus is a test of the skewness and kurtosis of the residual. A high value indicates a skewed distribution whereas a low value (close to zero) would indicate a normal distribution
- The Prob (Omnibus) performs a statistical test indicating the probability that the residuals are normally distributed. We hope to see something close to 1.
Omnibus 59
Prob(Omnibus): 0.000
🚩 Here, the Prob(Omnibus) value = 0, indicating that the residuals are not normally distributed and Omnibus is high, indicating a skewness already detected visually on the histogram. These observations might be due to the influence of outliers.
- The Jarque-Bera statistic indicates whether or not the residuals are normally distributed. The null hypothesis for this test is that the residuals are normally distributed. When the p-value (probability) for this test is low (< 5%), the residuals are not normally distributed, indicating potential model misspecification (i.e a key variable is missing from the model).
Prob(JB): 1.47e-38
🚩 The JB statistic indicates that the residuals are not normally distributed. Again, it might be due to the influence of outliers.
- The skewness coefficient reflects the data symmetry which can indicate normalcy. We want to see something close to zero, indicating that the residual distribution is normal. Negative skew refers to a longer or fatter tail on the left side of the distribution, while positive skew refers to a longer or fatter tail on the right.
Skew: -1.411
🚩 We can see clearly the skewness already detected in the histogram on the left side. A very skewed distribution is likely to have outliers in the direction of the skew.
There are also a variety of statistical tests for normality, including the Kolmogorov-Smirnov test, the Shapiro-Wilk test and the Anderson-Darling test. All of these tests are relatively "picky". Real data rarely has errors that are perfectly normally distributed, and it may not be possible to fit your data with a model whose errors do not violate the normality assumption at the 5% level of significance.
It is usually better to focus more on violations of the other assumptions and/or the influence of outliers which may be mainly responsible for violations of normality anyway.
3. Homoscedasticity
The assumption of homoscedasticity is that the residuals are equals for all predicted dependent variables scores. In other words, it means that the variance around the regression line is the same for all values of the predictor variable (X).
Violation of the homoscedasticity assumption results in heteroscedasticity when values of the dependent variable seem to increase or decrease as a function of the independent variables.
Typically, homoscedasticity violations occur when one or more of the variables under investigation are not normally distributed. To check the homoscedasticity you can start with:
- Studentized residuals vs Fitted values
We have already used this plot before, in this case, we want to see the residual distributed symmetrically on a horizontal line.

- Breush-Pagan test
This test measures how errors increase across the explanatory variable.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['Lagrange multiplier statistic', 'p-value',
'f-value', 'f p-value']
test = sms.het_breuschpagan(reg_multi.resid, reg_multi.model.exog)
lzip(name, test)
>> [('Lagrange multiplier statistic', 4.231064027368323),
('p-value', 0.12056912806125976),
('f-value', 2.131148563286781),
('f p-value', 0.12189895632865029)]
✅ If the test statistic has a p-value below the alpha risk (e.g. 0.05) then the null hypothesis of homoskedasticity is rejected and heteroskedasticity is assumed. In our case, we validate the assumption of homoscedasticity.
There are many other ways to test the condition of heteroscedasticity, among the long list, I’d like to mention the White test, which is especially indicated for large datasets.
4. Independence
Independance of residual is commonly referred to as the total absence of autocorrelation. Even though uncorrelated data does not necessarily imply independence, one can check if random variables are independent if their mutual information tends to 0.
- This assumption is especially dangerous in time-series models or longitudinal datasets, where serial correlation in the residuals implies that there is room for improvement in the model.
- This assumption is also important in the case of non-time-series models or cross-sectional datasets. If residuals always have the same sign under particular conditions, it means that the model systematically underpredicts/overpredicts what happens.
To verify that condition, we can use ACF (autocorrelation function) plots and Durbin-Watson test.
- ACF plot
We want to see if the value of ACF is significant for any lag. In our case, we are using non-time-series data so we can use the row number instead. In such cases, rows should be sorted in a way that (only) depends on the values of the feature(s).
import statsmodels.tsa.api as smt
acf = smt.graphics.plot_acf(reg_multi.resid, alpha=0.05)

- Durbin-Watson
The test will output values between 0 and 4. Here are how to interpret the results of the test:
- value = 2 means that there is no autocorrelation in the sample,
- values < 2 indicate positive autocorrelation,
- values > 2 negative autocorrelation.
import statsmodels.stats.stattools as st
st.durbin_watson(residuals, axis=0)
>> 2.0772952352565546
✅ We can reasonably consider the independence of the residuals.
5. Multicollinearity
Last but not least, for multiple linear regression it’s a good idea to check for multicollinearity. Multicollinearity occurs when your model includes multiple factors that are correlated not just to your response variable, but also to each other. In other words, it results when you have factors that are a bit redundant.
And you remember, a good model is a simple model.
You can think of multicollinearity in terms of football games:
If one player tackles the opposing quarterback, it’s easy to give credit for the sack where credit’s due. But if three players are tackling the quarterback simultaneously, it’s much more difficult to determine which of the three makes the biggest contribution to the sack.
Multicollinearity makes some variables statistically insignificant when they should be significant.
You can check the existence of collinearity between two or more variables with the Variance Inflation Factor (VIF). It’s a measure of colinearity among predictor variables within a multiple regression.
We generally consider that a VIF of 5 or 10 and above (depends on the business problem) indicates a multicollinearity problem.
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# For each X, calculate VIF and save in dataframe
df.drop(['sales'],axis=1,inplace=True)
X = add_constant(df)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif.round(1) #inspect results

✅ Here, we can see that we don’t have a multicollinearity problem.
Apart from this, the correlation matrix of predictors may indicate the presence of multicollinearity. If not always, a correlation matrix can be a good indicator of multicollinearity and highlight the need for further investigation.
plt.subplots(figsize=(15,7))
sns.heatmap(df.corr(), annot = True, cmap='coolwarm')

sns.pairplot(data = df)

Conclusions
Once you have run your linear regression model and gotten some significant results, you should check if the assumptions supporting the validity of the model are verified:
- The relationship between X (the explanatory variable) and Y (the dependent variable) is linear.
- The residuals are normally distributed.
- The residuals have constant variance (i.e homoscedasticity)
- The residuals are independent.
- There is no multicollinearity.
In this article, we use a dataset to predict sales based on advertising budgets. We consider that the conditions are fulfilled but (of course!) the model can be improved.
In fact, we could work now on: _- Assessing the impact of outliers and removing the biggest influencers;
- Adding relevant explanatory variables
- Transforming the explanatory variables (fx. with log): which can be relevant in our case as the relationship between X and Y looks pretty close to polynomial regression._
Finally, if you are facing a dataset where the conditions are not verified, here you can find some good tips on how to fix it:
Thanks for reading! If you like this article, make sure to hold the button clap to support my writing. You can also follow my work on Linkedin.