The Poisson Hidden Markov Model for Time Series Regression

How a mixture of two powerful random processes can be used to model time series data

Sachin Date

Follow

Published in

Towards Data Science

11 min readNov 27, 2021

--

A Poisson Hidden Markov Model uses a mixture of two random processes, a Poisson process and a discrete Markov process, to represent counts based time series data.

Counts based time series data contain only whole numbered values such as 0, 1,2,3 etc. Examples of such data are the daily number of hits on an eCommerce website, the number of bars of soap purchased each day at a department store and so on.

Such data cannot be adequately represented using models such as the Linear model or the ARIMA model because in those models, the dependent variable (y) is assumed to be real valued with both positive and negative values allowed. Counts based time series data require the use of models which assume that the dependent variable is 1) discrete, and 2) non-negative, in other words, whole-numbered.

The Poisson or Poisson-like models such as the Generalized Poisson and the Negative Binomial models often work well for modeling whole numbered datasets.

Unfortunately, the Poisson family of models do not explicitly account for, and therefore are unable to adequately capture the auto-correlation in the data, which happens to be a hallmark of time series datasets.

Over the years, researchers have devised several modifications to the basic Poisson model to enable it to account for the auto-correlated-ness in the time series data. Notable among them are the Poisson Auto-Regressive model and the Poisson Integer ARIMA model in which the Poisson conditional mean is expressed as a linear combination of not only the regression variables X but also lagged copies of the dependent variable y.

The Poisson Hidden Markov Model goes one step further by mixing in a discrete k-state Markov model that is ’hidden’ in that at each time step in the time series, one does not know for sure which Markov regime the Markov process is in. Instead, at each time step, one estimates the effect of the possible existence of each regime on the mean value that is predicted by the Poisson model. This makes the Poisson process mean a random variable whose expected value is a function of the probability that the underlying Markov Model is in a particular state.

In this article, we will precisely formulate this relationship between the mean predicted by the visible Poisson model and the Hidden Markov Model.

If you are new to Markov Processes or Hidden Markov Models, please go though the following two articles:

A Beginner’s Guide to Discrete Time Markov Chains

And a tutorial on how to simulate a discrete Markov process using Python

towardsdatascience.com

A Math Lover’s Guide to Hidden Markov Models

How they work, and why they are 'Hidden’.

towardsdatascience.com

I’d also suggest reviewing the following article on the Markov Switching Dynamic Regression Model to get a detailed understanding of how the MSDR model is built. The Poisson HMM is an MSDR model where the ‘visible’ portion of the model obeys a Poisson process.

A Worm’s Eye-View of the Markov Switching Dynamic Regression Model

The MSDR model explained in detail and a Python tutorial on MSDR using a real world data set

towardsdatascience.com

Detailed specification of the Poisson Hidden Markov Model

We will first elaborate the ‘visible’ Poisson process, and then show how the Markov process ‘mixes’ into the Poisson process.

Consider the following model equation that incorporates an additive error component:

y_t expressed as the sum of the modeled mean and an error term — y_t expressed as the sum of a mean and an error term (Image by Author)

In the above model, the observed value y_t is the sum of the predicted value μ_cap_t and the residual error ε_t. We further assume that ε_t is a homoskedastic (constant variance) normally distributed random variable with a mean of zero, represented as N(0, σ²).

y_t is the t-th element of the [n x 1] vector of observed values y:

The dependent variable y (Image by Author)

We assume that y obeys a Poisson process. And therefore, y_t for t in [1…n] are n independent, Poisson distributed random variables, each with a possibly different mean μ_t as shown below:

A time series that obeys the Poisson process — A time series y that obeys the Poisson process (Image by Author)

The Probability Mass Function (PMF) of y, which is just another way of saying —the probability of observing y_t — is given by:

The PMF of a Poisson distributed y with mean μ_t — The PMF of a Poisson distributed y with mean *μ_t (Image by Author)*

In a trained regression model, we replace μ_t with the ‘fitted’ mean μ_cap_t.

Let X be an [n X (m+1)] sized matrix of regression variables as shown below. The first column of this matrix is a placeholder for the intercept, and x_t is one row of this matrix.

The regression variables matrix X (Image by Author)

Let β_cap be an [(m+1) X 1] vector of regression coefficients. The ‘cap’ on β indicates that it is the fitted value of the coefficient produced from training the model.

Vector of fitted regression coefficients β_cap (Image by Author)

We will express the mean μ_cap_t as the exponentiated linear combination of x_t and β_cap as follows:

The exponentiated mean of the Poisson regression model (Image by Author)

Where, the dot product between x_t and β_cap can be expressed in expanded form as follows:

A linear combination of regression variables and fitted regression coefficients (Image by Author)

The exponentiation of the dot product ensures that the mean and therefore the model’s prediction is never negative. This is a key requirement for modeling a data set of whole numbered counts.

This completes the specification of the Poisson portion of the Poisson HMM. Now, let’s turn our attention to the Markov portion.

Mixing-in the Hidden Markov Model

We will see how to ‘mix’ a discrete Markov model into the Poisson regression model.

Consider a k-state Markov process that is assumed to be in some state j ϵ [1,2,3,…k]. We do not know which state the Markov process is in at time t. We only assume that it influences the Poisson process model in the following manner:

The observed value y_t expressed as a sum of the predicted mean μ_cap_s_t and residual error ε_t — The observed value y_t expressed as a sum of the predicted mean *μ_cap_t_j and residual error* ε_t (Image by Author)

Notice that the fitted mean μ_cap_t_j is now indexed with the Markov state j. μ_cap_t_j is expressed as follows:

The exponentiated mean of the Poisson HMM when the underlying Markov process is in state j (Image by Author)

Notice that we are now working with a Markov state-specific regression coefficients vector β_cap_j corresponding to the jth Markov state.

If the Markov model operates over ‘k’ states [1,2,…,j,…,k], the regression coefficients takes the form of a matrix of size [(m+1) X k] as follows:

The coefficients matrix of size [k x (m+1)] (Image by Author)

The intuition here is that depending on which Markov state or ‘regime’, the regression model coefficients will switch to the appropriate regime-specific vector β_cap_j from β_cap_s.

The k-state Markov process is governed by the following state transition matrix P, where each element p_ij is the probability of transitioning to j at time t given that the process was in state i at time (t-1):

The state transition matrix P of the Markov process (Image by Author)

The Markov process also has the following state probability distribution π_t at time step t:

The state probability distribution vector of the k-state Markov process (Image by Author)

Quick tip: Some texts call the state vector δ_t instead of π_t, and they call the Markov state transition probabilities γ_ij instead of p_ij.

Knowing/assuming some value for π_0 at t=0, we calculate π_t as follows:

The formula for the state probability distribution of a Markov process at time t, given the probability distribution at t=0 and the transition matrix P (Image by Author)

Let’s circle back to the state-specific formula for the Poisson mean:

μ_cap_t_j is the predicted mean of the Poisson regression model at time t assuming that the underlying Markov process is in state j. Since we don’t actually know which state the Markov process is in at time t, at each time step, for a k-state Markov process, we have to work with k such predicted means as follows:

The vector of k predicted means from the Poisson regression model corresponding to the k possible states that the Markov process could be in (Image by Author)

It would be absurd to generate k predictions for y_t at each time step. Therefore, we collapse these k predictions into a single prediction μ_cap_t using the formula for expectation. The trick lies in realizing that μ_cap_t is a random variable that has k possible values, and each value is associated with a probability of occurrence. And this probability is simply π_tj which is the unconditional probability that the underlying Markov process would be in state j at time t.

Therefore:

The predicted mean of the Poisson HMM is the expectation across all possible Markov states as follows:

The predicted mean expressed as the expected value of the random variable μ_cap_t (Image by Author) — The predicted mean expressed as the expected value of the random variable ***μ_cap****_t (Image by Author)*

The probabilities π_tj are the component values of the vector π_t. At each time step t, we calculate π_t using the following formula:

Training and estimation

Training of the Poisson Hidden Markov model involves estimating the coefficients matrix β_cap_s and the Markov transition probabilities matrix P. The estimation procedure is usually either Maximum Likelihood Estimation (MLE) or Expectation Maximization.

We’ll describe how MLE can be used to find the optimal values of P and β_cap_s that would maximize the joint probability density of observing the entire training data set y. In other words, we would want to maximize the following product:

The Likelihood of observing the data set (Image by Author)

In the above product, the probability P(y=y_t) is the Probability Mass Function (PMF) of the Poisson process:

The Poisson conditional PMF (Image by Author)

The probability on the L.H.S. is read as the conditional probability of observing y_t at time t given the fitted mean rate μ_cap_t. μ_cap_t is the expected value of the predicted mean across all possible regimes as calculated using the formula for expectation that we saw earlier.

There is a second way to calculate the Poisson PMF.

Let’s once again look at the formula for the predicted Poisson probability of observing y_t, assuming that the underlying Markov process is in state j:

Predicted Poisson probability of observing *y_t*, assuming that the underlying Markov process is in state j (Image by Author)

Where:

The Markov state-specific linear combination is expressed as follows:

The Markov state-specific linear combination of the regression variables vector x_t and the state-specific regression coefficients β_cap_j — The Markov state-specific linear combination of the regression variables vector x_t and the state-specific regression coefficients ***β_cap***_j (Image by Author)

Using the Poisson PMF, and assuming a k-state Markov process, at each time step t, we would get k probabilities of observing y_t, each one conditional upon the Markov process being in state j. The following vector captures these k probabilities:

The k Poisson probabilities of observing y_t, each one conditional upon the Markov process being in state j at time t — *The k* Poisson probabilities of observing *y_t*, each one conditional upon the Markov process being in state *j at time t (Image by Author)*

As before, for each time step, we’d want to collapse these k probabilities into a single probability. To do so, we appeal to the Law of Total Probability which states that if event A can take place jointly with either event A1, or event A2, or event A3, and so on, then the unconditional probability of A can be expressed as follows:

The Law of Total Probability (Image by Author)

Using this law, we will calculate the probability predicted by the Poisson HMM of observing y_t at time t as follows:

The probability density of y under the influence of a k-state Markov model (Image by Author)

In the above summation, P(s_t=1), P(s_t=2),…etc. are the Markov process’s state probabilities at time t. We know from our earlier discussion that these are simply π_tj — the different components of the vector π_t.

Effectively, we have figured out two different ways of calculating the Poisson probability P(y=y_t).

Let’s circle back to the Likelihood equation:

It is often expeditious to maximize the natural logarithm of this product since it has the benefit of converting products into sums and the latter are easier to work with in Calculus (we’ll soon see why). Hence, we will maximize the following log-likelihood denoted by the stylized ℓ:

The Log-Likelihood of observing the data set (Image by Author)

Maximization of Log-Likelihood is done by using the following procedure:

We will take the partial derivative of the log-likelihood w.r.t. each transition probability p_ij in P, and each state specific coefficient β_cap_q_j in the coefficients matrix β_cap_s where q lies in [0,1,…,m] and the Markov state j lies in [1,2,…,k].
We will set each partial derivative to zero, and,
We will solve the resulting system of (k² +(m+1)*k) equations (in practice, much fewer than that number) corresponding to k² Markov transition probabilities, (m+1)*k coefficients in β_cap_s using some optimization algorithm such as Newton-Raphson, Nelder-Mead or Powell’s.

There is a little wrinkle that we need to smooth out. All throughout the optimization process, the Markov state transition probabilities p_ij need to lie in the [0.0, 1.0] interval and the probabilities across any row of P need to sum to 1.0. In equation form, these two constraints are expresses as follows:

The constraints obeyed by all Markov state transition probabilities (Image by Author)

Summation to 1.0 constraint becomes obvious when one notes that if the Markov process is in state i at time (t-1), then at the next time step t, it has to be in one of the available states [1,2,3,…,k].

During optimization, we tackle these constraints by defining a matrix Q of size (k x k) as follows:

Q acts as a proxy for P. Instead of optimizing P, we optimize Q by allowing q_ij to range freely from -∞ to +∞. In each optimization iteration, we obtain p_ij by standardizing the locally optimized q values to the interval [0.0, 1.0], as follows:

Standardization of the Q matrix to get the P matrix (Image by Author)

In my next week’s article, we will build and train a Poisson Hidden Markov Model using Python and statsmodels. So join me back on the same topic, next week. Happy modeling!

References and Copyrights

Books

Cameron, A., & Trivedi, P. (2013). Regression Analysis of Count Data (2nd ed., Econometric Society Monographs). Cambridge: Cambridge University Press. doi:10.1017/CBO9781139013567

James D. Hamilton, Time Series Analysis, Princeton University Press, 2020. ISBN: 0691218633

Images

All images are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

The Poisson Hidden Markov Model for Time Series Regression

How a mixture of two powerful random processes can be used to model time series data

A Beginner’s Guide to Discrete Time Markov Chains

And a tutorial on how to simulate a discrete Markov process using Python

A Math Lover’s Guide to Hidden Markov Models

How they work, and why they are 'Hidden’.

A Worm’s Eye-View of the Markov Switching Dynamic Regression Model

The MSDR model explained in detail and a Python tutorial on MSDR using a real world data set

Detailed specification of the Poisson Hidden Markov Model

Mixing-in the Hidden Markov Model

Training and estimation

References and Copyrights

Books

Images

Related Reads

A Beginner’s Guide to Discrete Time Markov Chains

And a tutorial on how to simulate a discrete Markov process using Python

A Math Lover’s Guide to Hidden Markov Models

How they work, and why they are ‘Hidden’.

Written by Sachin Date