(Image by Author)

When Your Regression Model’s Errors Contain Two Peaks

A Python tutorial on dealing with bimodal residuals

Towards Data Science
9 min readMay 30, 2020

--

A raw residual is the difference between the actual value and the value predicted by a trained regression model.

Residual error = Actual — Predicted (Image by Author)

The frequency distribution plot of residuals can provide a good feel for whether the model is correctly specified, i.e. whether it is the right kind of model for the data set, and whether all the important regression variables have been considered, and whether the model has fitted the data in an unbiased manner.

What you usually want to see is a normally distributed plot of residuals that is centered around zero like the one shown below.

A normally distributed frequency plot of residual errors (Image by Author)

A normally distributed frequency plot of residuals is one sign of a well-chosen, well-fitted model.

But residual plots are often skewed, or they have fat tails or thin tails, and sometimes they are not centered at zero. There are ways to address these problems. And sometimes one has to simply accept some degree of non-normality.

In this article, I’ll show you what to do when your model’s residuals turn out to be bimodal, i.e. they have two peaks instead of one, like so:

Bimodal frequency distribution (Image by Author)

The dataset

We’ll use a data set of daily usage of rental bicycles spanning two years. Here are the first 10 rows of the dataset:

Rental bike usage counts (Source: UCI Machine Learning Repository) (Image by Author)

The variables in the data set are as follows:

Instant: The row index
dteday: The day on which the measurement was taken in dd-MM-yy format
season: the prevailing weather season
yr: the prevailing year: 0=2011, 1=2012
mnth: the prevailing month: 1 thru 12
holiday: Whether the measurement was taken on a holiday (yes=1, no=0)
weekday: day of the week (0 thru 6)
workingday: Whether the measurement was taken on a working day (yes=1, no=0)
weathersit: The weather situation on the day: 1=Clear, Few clouds, Partly cloudy, Partly cloudy. 2=Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist. 3=Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds. 4=Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog.
temp: Temperature, normalized to 39C
atemp: Real feel, normalized to 50C
hum: Humidity, normalized to 100
windspeed: Wind speed, normalized to 67
casual_user_count: count of casual bicycle renters
registered_user_count: count of registered (member) bicycle renters
total_user_count: count of total bicycle renters

You can download the data set from here.

The regression model

We’ll build a regression model in which the dependent variable is registered_user_count, and explanatory variables or the covariates as they are called, are the following:
season, mnth, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed.

Since we are modeling counts, we will use the Poisson regression model from the Python statsmodels library.

Let’s begin by importing all the required packages:

import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import statsmodels.stats.stattools as st
import matplotlib.pyplot as plt

Load the dataset into a Pandas data frame:

df = pd.read_csv('bike_sharing_dataset_daywise.csv', header=0, parse_dates=['dteday'], infer_datetime_format=True)

Create the training and test data sets:

mask = np.random.rand(len(df)) < 0.8df_train = df[mask]df_test = df[~mask]print('Training data set length='+str(len(df_train)))print('Testing data set length='+str(len(df_test)))

Create the regression expression in Patsy syntax. We are saying that registered_user_count is the dependent variable and it depends on all the variables mentioned on the right side of ~.

expr = 'registered_user_count ~ season + mnth + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed'

Set up the X, y matrices:

y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

Build and train the Poisson regression model:

poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()

poisson_training_results is an object of type GLMResults. Print the model summary:

print(poisson_training_results.summary())

This prints out the following:

Output of the fitted Poisson regression model (Image by Author)

It’s good to see that all model coefficients are statistically significant at a p-value of < 0.001 i.e. at 99.999% confidence level.

As against linear regression models, models in which the dependent variable is a count, rarely produce normally distributed residual error distributions.

So we have to normalize the raw-residuals using other means. Three popular transformations are:

  • The Pearson residual
  • The Anscombe residual
  • The Deviance residual

Statsmodels makes all three kinds of residual errors available to us via GLMResults.resid_pearson, GLMResults.resid_anscombe, and GLMResults.resid_deviance variables

The raw residuals are available in GLMResults.resid_response.

We’ll print out the Skewness and the Kurtosis of all 4 kinds of residuals to see which kind is the most normally distributed. A perfectly normal distribution has a Skewness of zero, and a Kurtosis of 3.0.

raw_residual_skewness = st.robust_skewness(poisson_training_results.resid_response)[0]pearson_residual_skewness = st.robust_skewness(poisson_training_results.resid_pearson)[0]anscobe_residual_skewness = st.robust_skewness(poisson_training_results.resid_anscombe)[0]deviance_residual_skewness = st.robust_skewness(poisson_training_results.resid_deviance)[0]

raw_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_response)[0]
pearson_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_pearson)[0]anscobe_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_anscombe)[0]deviance_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_deviance)[0]

residual_stats = [
['Raw residual', raw_residual_skewness, raw_residual_kurtosis],['Pearson\'s residual', pearson_residual_skewness, pearson_residual_kurtosis],
['Anscombe residual', anscobe_residual_skewness, anscobe_residual_kurtosis],
['Deviance residual', deviance_residual_skewness, deviance_residual_kurtosis]
]

residual_stats_df = pd.DataFrame(residual_stats, columns=['Residual', 'Skewness', 'Kurtosis'])

print(residual_stats_df)

This prints out the following table:

We are looking for the residual type with Skewness that closest to zero, and Kurtosis that is closest to 3.0. Given how close all the values are in this table, it’s hard to make a good choice of residual type from this table. We’ll choose the Pearson’s residual as it’s Skewness is closest to zero.

Let’s plot the frequency distribution of the Pearson’s residuals:

poisson_training_results.resid_pearson.hist(bins=50)plt.show()

We see the following bimodal distribution!

Regression errors (Pearson’s residuals) appear to have a bimodal distribution (Image by Author)

What could be going on here that caused the regression errors to be bimodal?

When regression errors are bimodal, there can be a couple of things going on:

The dependent variable is a binary variable such as Won/Lost, Dead/Alive, Up/Down etc. But your regression model may be generating as predictions, a continuously varying real valued values. So if you have encoded (Won=1, Lost=0), or (Dead=0, Alive=1) etc. and your regression model generates predicted values in a narrow range around 0.5, for e.g. 0.55, 0.58, 0.6, 0.61, etc, then the regression errors will peak either on one side of zero (when the true value is 0), or on the other side of zero (when the true value is 1).

This in turn can happen if you have not chosen the right kind of regression model for the data set, and/or you are missing key explanatory variables without which most of the predictions are hovering around 0.5.

In our example, the dependent variable is a count, and we have used a model that is appropriate for counts, namely the Poisson model, so the above situation can be ruled out.

Another reason for bimodal residuals is that one may have left out a binary regression variable which influences the output value in the following way:

When the variable’s value is 0, the output ranges within a certain range.

When the variable’s value is 1, the output takes on a whole new range of values that are not there in the earlier range.

It turns out, that is indeed the case in our example! We have left out the binary variable yr (year of observation) which takes on two values: 0=year 2011, and 1=year 2012.

Let’s see how yr influences the dependent variable registered_user_count. We’ll plot registered_user_count against yr:

df.plot.scatter('yr', 'registered_user_count')plt.show()
Registered user count versus year (Image by Author)

All the values of registered_user_count lying in the red circle occur only when year=1 i.e. 2012. Since we left out yr in our model, our model couldn’t explain the presence of higher valued counts. It made a systematic error each time the actual value was in the 5000–7000 range, leading to the second peak developing in the residuals plot.

If this theory is correct, adding yr into the model should resolve the bimodality of residuals. So let’s add yr into the regression expression and build and train the model again:

#Set up the regression expression. This time, include yr.
expr = 'registered_user_count ~ yr + season + mnth + holiday + weekday + workingday + weathersit + temp + atemp + hum + windspeed'

#Set up the X and y matrices for the training and testing data sets
y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

#Using the statsmodels GLM class, train the Poisson regression model on the training data set
poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()

#print out the training summary
print(poisson_training_results.summary())

This yields the following output:

Poisson regression model results (Image by Author)

This time, notice that yr is one of the regression variables and its coefficient is statistically significant at 99.999% confidence level.

Also, the Log-likelihood of this model is -52683, which is considerably greater than the Log-likelihood (-114530) of the previous model without yr. Log-Likelihood is a measure of goodness-of-fit. In this case, it indicates that adding yr has considerably improved the model’s goodness of fit. This is consistent with our theory that the absence of yr was preventing our model from explaining all those high-value counts.

So far so good.

Let’s inspect all 4 kinds of residuals of the revised model containing yr:

deviance_residual_skewness = st.robust_skewness(poisson_training_results.resid_deviance)[0]

raw_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_response)[0]
pearson_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_pearson)[0]anscobe_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_anscombe)[0]deviance_residual_kurtosis = st.robust_kurtosis(poisson_training_results.resid_deviance)[0]

residual_stats = [
['Raw residual', raw_residual_skewness, raw_residual_kurtosis],['Pearson\'s residual', pearson_residual_skewness, pearson_residual_kurtosis],
['Anscombe residual', anscobe_residual_skewness, anscobe_residual_kurtosis],
['Deviance residual', deviance_residual_skewness, deviance_residual_kurtosis]
]

residual_stats_df = pd.DataFrame(residual_stats, columns=['Residual', 'Skewness', 'Kurtosis'])

print(residual_stats_df)
(Image by Author)

Once again, the Pearson’s residual shines among the lot with a Skewness that is closest to zero. Although, the Kurtosis of the raw residual is closest to 3.0.

Let’s plot the frequency distribution of the Pearson’s residual of the revised model:

df.plot.scatter('yr', 'registered_user_count')plt.show()
Frequency distribution of model’s Pearson’s residuals (Image by Author)

This time, the bimodality has been extinguished to a large extent with the addition of the yr variable.

Let’s look at the two plots side by side, without yr on the left, and with yr on the right:

Regression residuals (left=model without yr, right=model with yr) (Image by Author)

We have successfully fixed most of the bimodality in the residuals by including an important binary regression variable. Adding the missing variable has also increased the model’s goodness-of-fit score by a considerable amount.

Let’s fetch the model’s predictions on the test data set and we’ll also plot the predicted versus actual counts:

#fetch the predictions
poisson_predictions = poisson_training_results.get_prediction(X_test)
#.summary_frame() returns a pandas DataFrame
predictions_summary_frame = poisson_predictions.summary_frame()
#print the predictions
print(predictions_summary_frame)
#The mean column contains the predicted count
predicted_counts=predictions_summary_frame['mean']
#get the actual count from y_test
actual_counts = y_test['registered_user_count']

#Plot the predicted counts versus the actual counts for the test data.
fig = plt.figure()
fig.suptitle('Predicted versus actual user counts')predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')plt.legend(handles=[predicted, actual])plt.show()

We get the following plot:

Predicted versus actual counts (Image by Author)

Here is the complete source code used in this article:

Source code

You can download the data set from here.

Citation for the data set:

Fanaee-T, Hadi, and Gama, Joao, “Event labeling combining ensemble detectors and background knowledge”, Progress in Artificial Intelligence (2013): pp. 1–15, Springer Berlin Heidelberg, doi:10.1007/s13748–013–0040–3.

Thanks for reading! If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on regression, and time series analysis.

--

--