The world’s leading publication for data science, AI, and ML professionals.

Forecasting with Stochastic Models

Using ARIMA models to Forecast Walmart Sales data

`Figure 1: Photo by Drew Beamer on Unsplash
`Figure 1: Photo by Drew Beamer on Unsplash

We all want to know the future. Imagine the power we’d possess if we knew what was going to happen in the future, we could alter it to get more suitable results, bet on it for financial gain, or even budget better. Although we can not outright determine what will happen in the future, we can build somewhat of an intuition of what it may be like.

We hear often from self-improvement evangelist that we can distinguish how we have arrived at our present point in life by reflecting on our past actions. Thereby, to some degree we can predict the trajectory of our lives if we continue on a particular path. This is the essence of Time-series analysis. As stated in Adhikari, R and Agrawal, R. (2013). _An Introduction Study on Time Series Modelling and Forecasting, "_The main aim of time-series modelling is to carefully collect and rigorously study the past observations of a time-series to develop an appropriate model which describes the inherent structure of the series. This model is then used to generate future values for the series, i.e. to make forecast. Time-series forecasting thus can be termed as the act of predicting the future by understanding the past."

"The present moment is an accumulation of past decisions" – Unknown

A popular and frequently used stochastic time-series model is the ARIMA model. It assumes that the time-series is linear and follows a particular known statistical distribution, such as the normal distribution, and has subclass of other models such as the Autoregressive (AR) model, the Moving average (MA) model, and the Autoregressive Moving Average (ARMA) model of which the ARIMA model was based on. Before effectively applying the ARIMA model to a problem, there are some things that we should understand about our data as you’ll come to understand by the end of this post.

Things to know – The 4 main components of a time series:

Trend → The propensity of a a time series to increase, decrease or stagnate over a long period of time.

Seasonality → Fluctuations within a year that are regular and predictable.

Cyclical → Medium term changes in the series that repeat in cycles.

Irregularities → Unpredictable influences that are not regular and do not repeat in a particular pattern.

Data

To download the data, click this Link and follow the instructions.

The data I will be using in this article is from the M5 Forecasting – Accuracy competition on Kaggle, which is currently still live (at the time of writing this article). The competition challenges competitors of whom have been provided with hierarchical sales data – thanks to Walmart- from 3 different states (California, Texas and Wisconsin) to forecast the sales 28 days into the future. For access to the code generated in this article can be found on a Kaggle Notebook that I created, which can be found here or in the link below.

A look at ARIMA model for Forecasting

Here are the frameworks we must import to perform the task at hand.

import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import plotly.graph_objects as go 
from plotly.subplots import make_subplots
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller

I have done some preprocessing to the data to make use of it’s hierarchical structure.

# store of the sales data columns
d_cols = full_df.columns[full_df.columns.str.contains("d_")]
# group columns by store_id
df= full_df.groupby(full_df["store_id"]).sum()[d_cols].T
df.head()
Figure 2: Data grouped by the store_id
Figure 2: Data grouped by the store_id

The competition is evaluated on RMSSE (Root Mean Squared Scaled Error), which is derived from the MASE (Mean Absolute Scaled Error) that was designed to be invariant and symmetric – you can learn more about forecast accuracy metrics here (the difference for this competition is that the A(Absoulute) in MASE is replaced with S(Squared) for Mean Squared Scaled Error and we take the Root of this for RMSSE).

Concept of Stationarity

Understanding the concept of stationarity is important as it has high impact on the type of model that we can fit to our data to forecast future values. We refer to a time-series as stationary when its properties do not depend on the time at which the series was observed. Some criteria for stationarity are as follows:

  • Constant mean in the time-series
  • Constant variance in the time-series
  • No seasonality

Simply, a time-series that is stationary will have no predictable patterns in the long term. For the mathematicians, a random process is known to be stationary when the joint distribution remains the same over time. Let’s look at some random items from our data to see whether they are stationary.

The ARMA model is a combination of the Autoregressive and Moving Average models. This traditional approach requires the data to be stationary, however, things do not always work out as we’d expect in the real world. In-fact, in the real world, data is much more likely to be non-stationary, hence the birth of ARIMA which uses a clever technique called differencing to make non-stationary data stationary.

Differencing

Differencing computes the change between consecutive observations in the original series, which helps to stabilize the mean since it removes the changes in the level of a series – This has the effect of eliminating (or reducing) seasonality and trend. This technique is widely used for non-stationary data such as financial and economic data. The ARIMA model adopts the differencing technique to convert a time-series that is non-stationary to a stationary time-series. We can express the differenced series mathematically as shown in Figure 2.

Figure 2: First difference
Figure 2: First difference

When the differenced data does not appear to stationary, we can do differencing a 2nd time – It’s almost never required for you to go past the 2nd order in practice – which can be expressed mathematically in Figure 3.

Figure 3: Formula for First difference of the 2nd degree.
Figure 3: Formula for First difference of the 2nd degree.

We can also get the difference of an observation and another observation from the same season. This phenomenon is known as seasonal differencing.

Figure 4: Formula for first degree seasonal differencing
Figure 4: Formula for first degree seasonal differencing

Occasionally, we may be required to take the ordinary differences (this is the differencing technique we discuss in Figure 2.Referred to as first differences meaning the differences at lag 1) and seasonal differences to make our data stationary.

Figure 5: Formula for first differences and Seasonal difference
Figure 5: Formula for first differences and Seasonal difference

In python, we can use visualization and/or unit root test to determine whether differencing is required for our data – not there are other methods to determine stationarity. There are many different unit root test which have different assumption, but we will be using the Dickey-Fuller. Below I will visualize a store and try to determine whether you think it is stationary before you look at the results of the dickey-fuller test.

Figure 6: Total sales per store - note that I have illuminated the store "CA_1". In the notebook you can click on whichever store you'd like to for it to be illuminated or visualize them all at the same time.
Figure 6: Total sales per store – note that I have illuminated the store "CA_1". In the notebook you can click on whichever store you’d like to for it to be illuminated or visualize them all at the same time.
# Dickey-fuller statistical test 
def ad_fuller(timeseries: pd.DataFrame, significance_level= 0.05):

    non_stationary_cols= []
    stationary_cols= []

    for col in timeseries.columns: 
        dftest= adfuller(df[col], autolag="AIC")
        if dftest[1] < significance_level:
            non_stationary_cols.append(
                       {col:{"Test Statistic": dftest[0],
                             "p-value": dftest[1],
                             "# Lags": dftest[2],
                             "# Observations": dftest[3],
                             "Critical Values": dftest[4],
                             "Stationary": False}})
        else: 
            stationary_cols.append(
                       {col:{"Test Statistic": dftest[0],
                             "p-value": dftest[1],
                             "# Lags": dftest[2],
                             "# Observations": dftest[3],
                             "Critical Values": dftest[4],
                             "Stationary": True}})
    return non_stationary_cols, stationary_cols
non_stationary_cols, stationary_cols= ad_fuller(df[stores])
len(non_stationary_cols), len(stationary_cols)
>>>> (10, 0) 
non_stationary_cols[0]
Figure 7: Augmented DIckey-Fuller results for store CA_1.
Figure 7: Augmented DIckey-Fuller results for store CA_1.

The p-value is greater than the significance level that we set (0.05) therefore we do not reject the null hypothesis that there is unit roots in our data. In other words, our data is non-stationary – it does not meet the criteria for stationarity that we described above, hence why we must do some differencing for our data to become stationary. Pandas has a cool function DataFrame.diff() that does this for us – you can read more in the documentation here.

# making the data stationary
df["lag-1_CA_1"]= df["CA_1"].diff().fillna(df["CA_1"])

ACF and PACF plots

The ARIMA model has hyperparameters, p, d and q that must be defined. Autocorrelation Function (ACF) and Partial Autocorrelation function (PACF) plots make determining the order p, and q of the model easier.

The ACF plot shows the autocorrelation of the time-series, meaning that we can measure the relationship between the yt and y{t-k}. The simplest way to put this is as the coefficients of correlation between a time-series and the lags of itself.

Note: "y_t" denotes the subscript.

PACF plots show a measure of relationship between yt and y{t-k} after the effects of lags are removed. If we think of correlation, it’s the interdependence of variables. The "partial" correlation talks of the correlation between them that is not explained by their mutual correlations with a specified set of other variables. When we adjust this for autocorrelation, we speak of the correlation between a time-series and a lag of itself that is not explained by correlations from lower order lags.

This is a great resource to learn more about ACF and PACF plots.

Let’s see some of our visualizations…

_, ax= plt.subplots(1, 2, figsize= (10,8))
plot_acf(df["lag-1_CA_1"], lags=10, ax=ax[0]), plot_pacf(df["lag-1_CA_1"], lags=10, ax=ax[1])
plt.show()
Figure 8: Autocorrelation Function and Partial Autocorrelation Function for lag-1_CA_1; As stated in Identifying the orders of AR and MA terms in an ARIMA model, by mere inspection of the PACF you can determine how many AR terms you need to use to explain the autocorrelation pattern in a time series: if the partial autocorrelation is significant at lag k and not significant at any higher order lags - i.e., if the PACF "cuts off" at lag k - then this suggests that you should try fitting an autoregressive model of order k;
Figure 8: Autocorrelation Function and Partial Autocorrelation Function for lag-1_CA_1; As stated in Identifying the orders of AR and MA terms in an ARIMA model, by mere inspection of the PACF you can determine how many AR terms you need to use to explain the autocorrelation pattern in a time series: if the partial autocorrelation is significant at lag k and not significant at any higher order lags – i.e., if the PACF "cuts off" at lag k – then this suggests that you should try fitting an autoregressive model of order k;

This suggest that we should try to fit an AR(8) model to our data, which I have done in the next section.

Lag/Backshift Notation

Lag/Backshift notation is an extremely useful notation device. Various sources use different notation to denote Lag, L or Backshift, B.

Figure 9: Backshift operator notation
Figure 9: Backshift operator notation

The Autoregression, AR(p), model generates forecast by using a linear combination of past variables of the variable. We can think of autoregression as regression of the variable against itself.

Figure 10: Autoregression model (without Lag/Backshift notation)
Figure 10: Autoregression model (without Lag/Backshift notation)

Moving Average, MA(q), model on the other hand uses past forecast errors instead of past values, in a regression like model. Therefore, we can think of each forecast value to be a weighted moving average of the past few forecast errors.

Figure 11: Moving average model (without backshift notation)
Figure 11: Moving average model (without backshift notation)

The ARIMA model

All roads lead to this point. If we combine differencing, our autoregression model and moving average model, we get ARIMA(p, d, q).

Figure 12: Arima formulation. Source: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020
Figure 12: Arima formulation. Source: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020

Note that it is often much easier to use lag notation to denote the ARIMA model. You can learn more about how to do this here.

p = The order of the Autoregressive part of the model

d= The degree of first differencing in our model

q = The order of the Moving average part of the model

Figure 13: Special cases of ARIMA. Source: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020
Figure 13: Special cases of ARIMA. Source: Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020
# fitting the model
model= ARIMA(df["lag-1_CA_1"], order=(8,1,0))
results= model.fit(disp=-1)
# visualizing the fitted values 
fig= go.Figure(data=
               [go.Scatter(x= df["date"],
                           y= df["lag-1_CA_1"],
                           name= "original", 
                           showlegend=True,
                           marker=dict(color="blue"))])
fig.add_trace(
    go.Scatter(x= df["date"],
               y=results.fittedvalues,
               name= "fitted values",
               showlegend= True, 
               marker=dict(color="red")))
fig.update_layout(
    title="Fitted values",
    xaxis_title="Dates",
    yaxis_title="Units Sold",
    font=dict(
        family="Arial, monospace",
        size=14,
        color="#7f7f7f"
    )
)
fig.show()
Figure 14: Fitted values from ARIMA model.
Figure 14: Fitted values from ARIMA model.

We can have a closer look…

# a closer look
_, ax= plt.subplots(figsize=(12,8))
results.plot_predict(1799, 1940, dynamic=False, ax=ax)
plt.show()
Figure 15: Closer look at Actual vs forecast
Figure 15: Closer look at Actual vs forecast

To see how we done against actual predictions, we must first go back to the original scale of the data to compare. There is a useful cumsum() function we can use in pandas – Documentation.

compare_df= pd.DataFrame({"actual": df["CA_1"],
                          "predictions": pd.Series(results.fittedvalues.cumsum(), copy=True),
                          "d": df["d"]}).set_index("d")
compare_df.loc["d_1", "predictions"]= 0

Then we plot this…

Figure 16: Actual vs Predictions of the model.
Figure 16: Actual vs Predictions of the model.

I have joined this competition a little late, but there is still sufficient time to better this result (which I will be sharing with you all).

Useful Resource: Workflow Guide

A useful flow chart was provided by Rob Hyndman in the book Forecasting: Principles and Practice which is extremely useful. A link to the online book will be in the Other Resources section below.

Figure 16: The ARIMA flow chart. Source - Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020
Figure 16: The ARIMA flow chart. Source – Hyndman, R.J., & Athanasopoulos, G. (2018) Forecasting: principles and practice, 2nd edition, OTexts: Melbourne, Australia. OTexts.com/fpp2. Accessed on 09/06/2020

Final Word

Thank you for taking the time to read this article. I am a self-taught Data Scientist from London, England. I can be reached via LinkedIn via https://www.linkedin.com/in/kurtispykes/. Please do not hesitate to reach out to me, meeting new people is awesome.

Other Resources:

Forecasting: Principles and Practice, 2nd Edition

An Introduction Study on Time Series Modelling and Forecasting

Autoregressive Integrated Moving Average


Related Articles