Bayesian Inference for AR(p) Models

Implementing AR models the “Bayesian” way. Code written in Julia, can be found here (Jupyter notebook with the plots and outputs) or here (.jl file).

Saumya Shah

Published in

Towards Data Science

8 min readAug 26, 2019

Example of Time Series Data; Photo by Isaac Smith on Unsplash

Hello!

A univariate time series roughly refers to a sequence which has a value (of some desired quantity) at each equally spaced, ordered time interval. For the multivariate case, there are more than one values (of different desired quantities) at each time instant. This article mainly focuses on univariate time series, which will now be referred to as just time series.

Examples of time series data include the monthly amount of rainfall in a particular region and the daily maximum temperature of a specific city, both over the past few years. As is evident from these two examples, obtaining accurate predictions for a given time series can be extremely useful. These predictions allow us to have a peek into the future that can help save countless lives.

AR models can be used for obtaining the predictions for time series. This article describes the process of parameter estimation for such models using Bayesian Inference. For those unfamiliar with this technique, I will provide a brief description. The users provide the data along with the prior distribution on the parameters that are to be estimated. These prior distributions, as the name suggests, are based on prior beliefs that the users may have about the parameters being estimated. Based on these inputs, the model outputs the distribution of the parameters given the data. This distribution, called the posterior distribution, can then be used to obtain predictions for unseen data, which in our case refers to the future values for our time series.

This was an extremely brief description of Bayesian Inference, which does not quite do justice to its vast and exciting applications. However, it will suffice for the purpose of this article, though I strongly encourage you to further read up on this topic. Don’t worry if you feel confused right now. Things will get clearer as we move on :).

I am currently contributing to Turing.jl under Julia Season of Contributions (JSoC) 2019, and this article describes one of the models that I implemented as part of JSoC. Turing is a Probabilistic Programming Language (PPL) written in Julia that makes defining models written on paper a seamless process. Moreover, it helps in easily sampling from compositional distributions.

Starting off with an AR(p) Model

The AR(p) model refers to an autoregressive model with p lag terms. Autoregression means that the variable is linearly regressed against his own past values. The lag parameter p determines how many past values to use at each instant. For a given a time series denoted by x, we get the following forecasting equation for an AR(p) model:

where ε denotes white noise. Now, with this basic information, let’s have a look at the code.

Importing Libraries

We begin by importing the required libraries. We will use Turing.jl, a Probabilistic Programming Language in Julia. From defining the model to getting samples from the posterior distribution, it will make the entire process smooth and effortless.

Loading and Visualising the Dataset

Let us first load in the dataset. We will use a Shampoo Sales Dataset that can be downloaded from here¹. The original dataset is credited to Makridakis, Wheelwright and Hyndman (1998)². It contains monthly sales of a shampoo over a period of 3 years. The code for this part and the graph obtained on plotting this dataset are as follows:

Splitting into Train and Test Data

We use a 90:10 train test split. Since there are a total of 36 observations (3 years data with monthly frequency), we will predict for the last 4 days (test set) using the data from the past 32 days (train set).

Analysing ACF and PACF Plots

Correlation is a statistical measure between two quantities (say A and B) that determines the increase or decrease in their values with respect to each other. If A increases as B increases (or vice versa), then they are said to have a positive correlation. For example, greater the force on an object of a given mass, greater would be its acceleration. If A increases as B decreases (or vice versa) i.e. if both the quantities change in opposite directions, then they are said to have a negative correlation.

A plot showing the correlation of a time series with its past values is called an autocorrelation function (ACF). Auto comes from the fact that we are computing the correlation of a series with the values of that same series.

Partial autocorrelation between xₜ (the element in the time series at time index t) and it's lag xₜ ₋ ₚ is the correlation between these two quantities after removing the effect of correlations of xₜ and xₜ ₋ ₚ with each of xₜ ₋ ₁, xₜ ₋ ₂, …, xₜ ₋ ₍ₚ ₋ ₁₎ i.e. with the lags shorter than p. A plot showing the partial correlations of a time series with its past values is called a partial autocorrelation function (PACF).

Shown below are the ACF and PACF plots for our training data:

An autoregressive model is marked by the following observations in the ACF and PACF plots:

Positive autocorrelation at lag 1
PACF cuts off sharply at some lag k. This value of k is equal to the value of p that we should use in our AR(p) model.
ACF plot gradually reducing to 0

We can see that our ACF and PACF plots satisfy the conditions in the above three points. Moreover, the value of p that we obtain for our AR(p) model is p = 2, since the PACF plot cuts off at the second lag. We move on to defining the AR(2) model.

Model Definition

We now define the AR(p) model using the forecasting equation discussed earlier. We take p = 2, which is obtained with the help of ACF and PACF plots. We can define the model quite easily with Turing’s intuitive interface:

Here, we have assumed Uniform(-1, 1) priors for the β coefficients and a Normal(0, 1) prior for α. Now, with the model definition in place, it’s time to sample our posterior!

Sampling

We sample using the No-U-Turn Sampler (NUTS) sampler with 5000 iterations (pulling out 5000 samples). If you are unfamiliar with Markov Chain Monte–Carlo sampling, you can treat it like a black box that gives out the posterior when provided with the model definition and the relevant parameters as input. For more information on using this sampler in Turing and the parameter definitions, you can refer to the docs here.

Analysing the Sampled Parameters

Running the above code will produce an output providing basic information about the chain and summary statistics of the estimated parameters as shown below:

With just a single line of code, we can get the distribution of the sampled parameters along with a plot of their values over all iterations.

We can also view a corner plot of these sampled parameters with another line of code.

Output of: L: corner(chain), R: plot(chain)

We now have all the tools required to get that peek into the future we discussed earlier. We will first remove the warmup samples from the sampled chain. These are the samples that were sampled initially in the chain during the first few iterations and are not required after we have obtained the required posterior distributions. We then take the mean of all the sampled parameters as a point estimate for the value of that parameter.

It is these point estimates that we will plug back into the forecasting equation to get the predictions. Note that for predicting the first few elements, we will have to use elements from the end of the training set (exactly how many elements for which this will have to be done depends on the value of p). This is because we need the previous lagged values of the time series to make the prediction for the current value. For instance, in the AR(2) model that we are using, the first two predictions will use values from the training set. Thus, we can compute the predicted time series s_pred as shown in the code below:

Plotting the predictions along with the test set (original data):

We can see that out of the four values to be predicted, only one was close while the other three were pretty inaccurate. A reason for this could be that this time series is not covariance stationary. We can improve upon this shortcoming by using more complex ARIMA models that use differencing to make first make the series stationary and then model the differenced series using autoregressive and moving average terms.

This concludes my article on Bayesian Inference for AR(p) models. I would like to thank Cameron Pfiffer for all providing great help and guidance during the course of this implementation and also over the past few months. If you have any questions or doubts regarding this article, feel free to contact me at sshah@iitk.ac.in or you can tag me on the Julia slack with @Saumya Shah. Thanks for reading! :).

References

[1] Jason Brownlee, 7 Time Series Datasets for Machine Learning, Machine Learning Mastery, Available from https://machinelearningmastery.com/time-series-datasets-for-machine-learning/, accessed August 26th, 2019.

[2] Source: Time Series Data Library (citing: Makridakis, Wheelwright and Hyndman (1998))