An Illustrated Guide to the Poisson Regression Model

And a tutorial on Poisson regression using Python

Sachin Date

Published in

Towards Data Science

13 min readSep 22, 2019

In this article, we’ll cover the following topics:

Features of count based data: Count data sets are some of the most commonly occurring data in the world. We’ll look at what makes count based data different.
Regression models for forecasting counts: We’ll look at Poisson regression model in detail. The Negative Binomial (NB) regression model is another commonly used model for count based data. I’ll cover that in a future article.
Python tutorial on Poisson regression: I will take you through a step-by-step tutorial on how to create a Poisson regression model in Python using the GLM class of statsmodels, and how to train it on a real world data set.

Before we begin, a few pointers…

For the Python tutorial on Poisson regression, scroll down to the last couple of sections of this article.
The Github gist for the Python code is over here.
A real world data set of bicyclist counts used in this article is over here.
For a primer on random variables, the Poisson process, and a Python program to simulate a Poisson process, click here:
The Poisson Process: Everything you need to know

Now let’s begin!

What is count based data?

Count based data contains events that occur at a certain rate. The rate of occurrence may change over time or from one observation to next. Here are some examples of count based data:

Data source: Wikipedia: potentially habitable exoplanets (Image by Author)

Number of vehicles crossing an intersection per hour,
Number of people visiting a doctor’s office per month,
Number of earth-like planets spotted per month.

A data set of counts has the following characteristics:

Whole number data: The data consists of non-negative integers: [0… ∞] Regression techniques such as Ordinary Least Squares Regression may not be appropriate for modeling such data as OLSR works best on real numbers such as -656.0, -0.00000345, 13786.1 etc.
Skewed Distribution: The data may contain a large number of data points for just a few values, thereby making the frequency distribution quite skewed. See for example above histogram.
Sparsity: The data may reflect the occurrence of a rare event such as a gamma ray burst, thereby making the data sparse.
Rate of occurrence: For the sake of creating a model, it can be assumed that there is a certain rate of occurrence of events λ that drives the generation of such data. The event rate may drift over time.

A real world data set of counts

The following table contains counts of bicyclists traveling over various NYC bridges. The counts were measured daily from 01 April 2017 to 31 October 2017.

Source: Bicycle Counts for East River Bridges (Data source: NYC OpenData) (Image by Author)

Here is a time sequenced plot of the bicyclist counts on the Brooklyn bridge:

Background image: The Brooklyn bridge as seen from Manhattan island

Regression models for counts

The Poisson regression model and the Negative Binomial regression model are two popular techniques for developing regression models for counts. Other possibilities are Ordered Logit, Ordered Probit and Nonlinear Least Squares models.

Regression strategy

It’s good practice to start with the Poisson regression model and use it as the “control” for either more complex, or less constrained models. In their book Regression Analysis of Count Data, Cameron and Trivedi say the following:

“A sound practice is to estimate both Poisson and negative binomial models.”

In this article, we’ll use the Poisson regression model for regressing the bicyclist counts observed on the Brooklyn bridge.

Introducing the Poisson model

The Poisson distribution has the following Probability Mass Function.

Probability of seeing k events in time t, given λ events occurring per unit time (Image by Author)

The expected value (mean) for a Poisson distribution is λ. Thus in the absence of other information, one should expect to see λ events in any unit time interval such as 1 hour, 1 day, etc. For any interval t, one would expect to see λt events.

A Poisson regression model for constant λ

If the event rate λ is constant, one can simply use a modified Mean Model for predicting future counts of events. In this case, one would set all predicted values of counts to this constant value λ.

The following figure illustrates the constant λ scenario:

Actual and predicted counts for a constant rate model (Image by Author)

The following Python code was used to generate the blue dots (actual counts in the past time steps) using a Poisson process with λ=5. The orange dots (predictions) are all set to the same value 5.

A Python program to generate event counts using a Poisson process

A Poisson regression model for a non-constant λ

Now we get to the fun part. Let us examine a more common situation, one where λ can change from one observation to the next. In this case, we assume that the value of λ is influenced by a vector of explanatory variables, also known as predictors, regression variables, or regressors. We’ll call this matrix of regression variables, X.

The job of the regression model is to fit the observed counts y to the matrix of regression values X .

In the NYC bicyclist counts data set, the regression variables are Date, Day of Week, High Temp, Low Temp and Precipitation. We can also introduce additional regressors such as Month and Day of Month that are derived from Date, and we have the liberty to drop existing regressors such as Date.

The fitting of y to X happens by fixing the values of a vector of regression coefficients β.

In a Poisson Regression model, the event counts y are assumed to be Poisson distributed, which means the probability of observing y is a function of the event rate vector λ.

The job of the Poisson Regression model is to fit the observed counts y to the regression matrix X via a link-function that expresses the rate vector λ as a function of, 1) the regression coefficients β and 2) the regression matrix X.

The following figure illustrates the structure of the Poisson regression model.

**Scan from left to right**: The structure of the Poisson regression model (Image by Author)

What might be a good link function f(.) connecting λ with X? It turns out the following exponential link-function works great:

The exponential link function of the Poisson regression model (Image by Author)

This link function keeps λ non-negative even when the regressors X or the regression coefficients β have negative values. This is a requirement for count based data.

In general, we have:

A Formal Specification of the Poisson Regression Model

The complete specification of the Poisson regression model for count based data is given as follows:

For the ith observation in the data set denoted by y_i corresponding to the row of regression variables x_i, the probability of observing the count y_i is Poisson distributed as per the following PMF:

Probability of observing count y_i given **x_i** *(as per the Poisson PMF formula)* (Image by Author)

Where the mean rate λ_i for the ith sample is given by the exponential link function shown earlier. We reproduce it here:

Once the model is fully trained on the data set, the regression coefficients β are known, and the model is ready to make predictions. To predict the event count y_p corresponding to an input row of regressors x_p that one has observed, one uses this formula:

The prediction equation for the Poisson regression model (Image by Author)

All of this hinges on our ability to train the model successfully so that the regression coefficients vector β is known.

Let’s look at how this training takes place.

Training the Poisson regression model

Training a Poisson regression model involves finding the values of the regression coefficients β which would make the vector of observed counts y most likely.

The technique for identifying the coefficients β is called Maximum Likelihood Estimation (MLE).

Let’s get acquainted with the technique of MLE.

Understanding Maximum Likelihood Estimation (MLE)

I’ll illustrate the MLE technique using the bicyclist counts data set. Take a look at the first few rows of this data set:

Daily counts of bicyclists on the Brooklyn bridge (Image by Author)

Our assumption is that the bicyclist counts shown in the red box arise from a Poisson process. Hence we can say that their probabilities of occurrence is given by the Poisson PMF. Here are the probabilities for the first 4 occurrences:

Probabilities of observing the bicyclist counts for the first few occurrences given corresponding regression vectors (Image by Author)

We can similarly calculate the probabilities for all n counts observed in the training set.

Note that in the above formulae, λ_1, λ_2, λ_3,…,λ_n are calculated using the link function as follows:

The event rates corresponding to the counts for the first few days (Image by Author)

Where x_1, x_2, x_3, x_4 are the first 4 rows of the regression matrix.

The probability of occurrence of the entire set of n counts y_1, y_2,…,y_n in the training set is the joint probability of occurrence of the individual counts.

The counts y are Poisson distributed, y_1, y_2,…,y_n are independent random variables, given correspondingly x_1, x_2,…,x_n. Hence the joint probability of occurrence of y_1, y_2,…,y_n can be expressed as a simple multiplication of the individual probabilities. Here is how the joint probability looks like for the entire training set:

The **likelihood function** L(β) *expressed as a joint probability mass function* (Image by Author)

Let’s recollect that λ_1, λ_2, λ_3,…,λ_n are linked to the regression vectors x_1, x_2,x_3,…,x_n via the regression coefficients β.

What value of β will make the given set of observed counts y most likely? It is the value of β for which the joint probability shown in the above equation achieves the maximum value. In other words, it is the value of β for which the rate of change of the joint probability function w.r.t. β is 0. Put another way, it is the solution of the equation obtained from differentiating the joint probability equation w.r.t. β and setting this differential equation to 0.

It is easier to differentiate the logarithm of the joint probability equation than the original equation. The solution to the logged equation yields the same optimal value of β.

This logarithmic equation is called the log-likelihood function. For the Poisson regression, the log-likelihood function is given by the following equation:

log-likelihood function for the Poisson regression model (Image by Author)

The above equation is obtained by taking the natural logarithm of both sides of the joint probability function shown earlier, after substituting the λ_i with exp(x_i*β).

As mentioned earlier, we differentiate this log-likelihood equation w.r.t. β, and set it to zero. This operation gives us the following equation:

The Poisson MLE for β *is the solution to this equation* (Image by Author)

Solving this equation for the regression coefficients β will yield the Maximum Likelihood Estimate (MLE) for β.

To solve the above equation one uses an iterative method such as Iteratively Reweighted Least Squares (IRLS). In practice, one does not solve this equation by hand. Instead, you use statistical software such as the Python statsmodels package which will do all the calculations for you while training the Poisson regression model on your data set.

Summary of steps for performing Poisson Regression

In summary, here are the steps for performing a Poisson Regression on a count based data set:

First, make sure that your data set contains counts. One way to tell is that it contains only non-negative integer values that represent the number of occurrences of some event during some interval. In the bicyclist counts data set, it is the count of bicyclists crossing the Brooklyn bridge per day.
Find out (or guess) the regression variables that will influence the observed counts. In the bicyclist counts data set the regression variables are Day of Week, Min Temp, Max Temp, Precipitation etc.
Carve out a training data set that your regression model will train on, and a test data set that should keep aside. Do not train the model on the test data.
Use a suitable statistical software such as the Python statsmodels package to configure and fit the Poisson Regression model on the training data set.
Test the performance of the model by running it on the test data set so as to generate predicted counts. Compare them with the actual counts in the test data set.
Use a goodness-of-fit measure to determine how well your model has trained on the training data set.

How to do Poisson Regression in Python

Let’s put into practice what we have learnt. The Python statmodels package has excellent support for doing Poisson regression.

Let’s use the Brooklyn bridge bicyclist counts data set. You can pick up the data set from here.

Our goal is to build a Poisson regression model for the observed bicyclist counts y. We will use the trained model to predict daily counts of bicyclists on the Brooklyn bridge that the model has not seen during training.

Daily count of bicyclists on the Brooklyn bridge (Image by Author)

We’ll start by importing all the required packages.

import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

Create a pandas DataFrame for the counts data set.

df = pd.read_csv('nyc_bb_bicyclist_counts.csv', header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])

We’ll add a few derived regression variables to the X matrix.

ds = df.index.to_series()df['MONTH'] = ds.dt.monthdf['DAY_OF_WEEK'] = ds.dt.dayofweekdf['DAY'] = ds.dt.day

We will not use the Date variable as a regressor since it contains an absolute date value but we don’t need to do anything special to drop Date as it is already consumed as the index of the pandas DataFrame. So it will not be available to us in the X matrix.

Let’s create the training and testing data sets.

mask = np.random.rand(len(df)) < 0.8df_train = df[mask]df_test = df[~mask]print('Training data set length='+str(len(df_train)))print('Testing data set length='+str(len(df_test)))

Setup the regression expression in patsy notation. We are telling patsy that BB_COUNT is our dependent variable and it depends on the regression variables: DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T and PRECIP.

expr = """BB_COUNT ~ DAY  + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""

Set up the X and y matrices for the training and testing data sets. patsy makes this really simple.

y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')

Using the statsmodels GLM class, train the Poisson regression model on the training data set.

poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()

Print the training summary.

print(poisson_training_results.summary())

This prints out the following:

Training summary for the Poisson regression model (Image by Author)

So how well did our model do? Let’s make some predictions on the test data set.

poisson_predictions = poisson_training_results.get_prediction(X_test)#.summary_frame() returns a pandas DataFrame
predictions_summary_frame = poisson_predictions.summary_frame()print(predictions_summary_frame)

Here are the first few rows of the output:

First few rows of output from **poisson_predictions.summary_frame()** (Image by Author)

Let’s plot the predicted counts versus the actual counts for the test data.

predicted_counts=predictions_summary_frame['mean']actual_counts = y_test['BB_COUNT']fig = plt.figure()fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')plt.legend(handles=[predicted, actual])plt.show()

Here’s the output:

Predicted versus actual bicyclist counts on the Brooklyn bridge (Image by Author)

The model seems to be more or less tracking the trend in the actual counts although in many cases its predictions are way off the actual value.

Let’s also plot Actual versus Predicted counts.

plt.clf()fig = plt.figure()fig.suptitle('Scatter plot of Actual versus Predicted counts')plt.scatter(x=predicted_counts, y=actual_counts, marker='.')plt.xlabel('Predicted counts')plt.ylabel('Actual counts')plt.show()

Here is the plot:

Scatter plot of Actual versus Predicted counts (Image by Author)

Here is the complete source code for doing Poisson regression using Python:

Goodness-of-fit of the Poisson regression model

Recollect that both the expected value (i.e. mean) and the variance, of the Poisson distribution is λ. This rather strict condition is violated by most real-world data.

A common source of failure of the Poisson regression model is that the data does not satisfy the mean = variance criterion imposed by the Poisson distribution.

The summary() method on the statsmodels GLMResults class shows a couple of useful goodness-of-fit statistics to help you evaluate whether your Poisson regression model was able to successfully fit the training data. Let’s look at their values:

The reported values of Deviance and Pearson chi-squared are very large. A good fit is virtually impossible given these values. To make a quantitative determination of the goodness-of-fit at some confidence level, say 95% (p=0.05), we look up the value in the χ2 table for p=0.05 and Degrees of freedom of residuals=163. (DF Residuals = No. Observations minus DF model]). We compare this Chi-Squared value with the observed statistic, in this case, the Deviance or the Pearson’s chi-squared value reported in GLMResults. We find that at p=0.05 and DF Residuals = 163, the chi-squared value from a standard Chi-Squared table is 193.791 which is much smaller than the reported statistic of 23030 and 23300. Hence as per this test, the Poisson regression model, in spite of demonstrating an ‘okay’ visual fit for the test data set, has fit the training data rather poorly.

Conclusion and next steps

For counts based data, a useful starting point is the Poisson regression model. One can then compare its performance with other popular counts based models, such as:

A Zero Inflated Poisson model if you suspect that your data contains excess zeros i.e. many more zeroes than what the regular Poisson model can explain.
The The Negative Binomial regression model which does not make the mean = variance assumption about the data.
Generalized Poisson regression models which also work well with over-dispersed or under-dispersed data sets.

Happy modeling!

Thanks for reading! I write about topics in data science, with a focus on regression and time series analysis.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression and time series analysis.