A Blueprint for Time Series Forecasting

An exploration of exponential smoothing and ARIMA models

Published in

Towards Data Science

14 min readJun 19, 2021

As mentioned in my previous story where I discussed how to ‘Tableau’ your predictions, let’s get onboard with time series forecasting.

Let’s start with the basics. What is a time series?

Time Series represents data that is ordered over a period of time, that is to say, we have observed data at equal intervals of time. Why does the order matter? This is because we assume what has happened today is dependent on what happened yesterday and so on. Since the chronology is important here, it isn’t right to consider a time series event as a random process. This implies that we cannot make a random sampling distribution from time series data.

Why would one use time series instead of predictive modeling like regression techniques? Hmm.

Sometimes, we may not have enough information (predictors) available to predict the future(target). Also, we may not establish significant correlation between the available predictors and our target variable.

So unlike predictive modeling, where you have predictors for explaining your target variable, time series forecasting rely on the target variable itself from previous time periods ( also called lags) for making predictions.

Illustration:
Predictive modelling: Revenue = 1.78*Profit + 2.5 (This is a simple linear regression equation that predicts revenue from information available from profits).Assume we do not have profit information.We turn to time series.
Time series model: Revenue(t+1) = a*Revenue(t) + a(a-1)*Revenue(t-1) + .. (In this model, revenue is dependent on the previous time periods’ revenue with a decaying weight as we go back in time by using a smoothing parameter ‘a’)

Let’s get started with looking at our data. We will fire up our Jupyter notebook.

Let’s have a quick look at our metadata.

superstore.info()

We are going to work with Sales and Order Date for time series

We are safe to use Order Date and Sales without having to go for any missing value treatment. Let’s build our time series.

Time for visualization.

There are many libraries for visualizing your data in python, the most widely used are matplotlib and seaborn. Here I have used plotly which renders an interactive visual. If you are interested to know more about plotly go here. You can use the code below :

Alright. Let’s dig deeper into the components of a time series. We will explore sample plots to understand the concepts better.

The main components of a time series are : Trend, Seasonality, Cyclic and White Noise (error) components. Trend, seasonality and cyclicity are the systematic components which can be modeled and the error is the random component.

Trend quantifies the overall upward or downward movement of the series.

Seasonality quantifies the intra-year fluctuations in your data that repeats every year. For example, umbrella sales go up during every rainy season of the year, ice cream sales during summer and a distribution company tends to bill more every quarter of the year; these are most likely to happen every year. Question :Will a yearly time series have seasonality? No.

A cyclic pattern exists when data exhibit rises and falls that are not of fixed period. Hence it is different from seasonality where we see a fixed pattern in the latter. Also, the average length of a cycle is much longer than seasons and generally span over years.

Mostly exhibiting cyclic behaviour over the years without a fixed pattern

Lastly, White Noise is the random error component that is assumed to have a normal distribution of mean µ = 0 and constant variance 𝛔².

We will focus on Trend, Seasonality and White Noise. Cyclic components are complex to model (at least to my knowledge so far) and we will try to explain the model using the other three components. Many times, cyclic components are explained by trend itself.

We can combine the components in two ways :

Additive Model : Y(t) = Trend(t) + Seasonality(t) + Error(t)

Multiplicative Model : Y(t) = Trend(t) * Seasonality(t) * Error(t)

Additive model will assume more or less constant seasonality over time. On the other hand multiplicative model will see significant changes in seasonality along with trend over time. Multiplicative models are more realistic. However in order to work with a multiplicative model, we will log transform our data to make it an additive model and then we can easily extract the trend and seasonality components. If you are familiar with logarithms, you should be able to relate to the below equations:

log(Y(t)) = log(Trend(t) * Seasonality(t) * Error(t))

log(Y(t)) = log(Trend(t)) + log(Seasonality(t)) + log(Error(t))

Visualization is a good start to understand the nature of our time series.

Illustration for additive model ( What’s your take on cyclic pattern? Yes or No ?)

Visualizations have their own limits. We need a way to quantify and confirm our findings — enter Decomposition.

Decomposition helps us to extract and quantify the components of time series. It also helps us understand what is more significant in explaining the time series.

Let’s get back to our superstore. After looking at the series, I will go ahead with an additive model. Let’s decompose our series : I have illustrated two among the various methods used.

Classic decomposition

Observe the trend and seasonal components carefully. We can see an upward trend and more or less constant seasonality across time. Also the residuals are centred around zero. This substantiates our assumption of additive model.

STL — robust method

STL will work only with additive models; if we have multiplicative model, we use log transformations. We won’t explore that in this story.

I’d like to highlight another way to identify seasonality in particular. We can leverage month_plot to see if any months have significant repetitive patterns across the years.

Using month_plot to identify seasonality

Seasonal behaviour observed with (Jan,Feb) with lowest sales and (Sept, Nov, Dec) with highest sales

Let’s explore various models we can work with for forecasting. We will walk through exponential smoothing methods and ARIMA to forecast our time series.

Exponential Smoothing Models

In exponential smoothing, we forecast values based on weighted averages of past observations. These weights decay as we go back in time. This implies that we give more significance to recent observations. Parameters are used to calculate the weights and they take a value between 0 and 1.

Simple Exponential Smoothing

This is used when there is neither significant trend nor seasonality in the data. We simply try to smoothen the level of the series by using the parameter ⍺. Level is nothing but the local mean. Mathematically we can represent this like below:

Y(t+1) = ⍺*Y(t) + ⍺(⍺-1)*Y(t-1) + ⍺(⍺-1)²*Y(t-2) + ……….. The weights decay as we go back in time and ⍺ is the smoothing parameter used that can be tuned between 0 and 1.

Let’s code this.

#Let's check accuracy score by using RMSE
Root mean squared error = 36203.014759074365

You can see that the forecast isn’t great. You can try tuning ⍺ to see if you get a better forecast. Simple exponential smoothing will always give a flat forecast and hence should be used only to predict for the next data point.

Holt’s Model or Double Exponential Smoothing

Here we consider two attributes : Level and Trend — this implies that we will have two smoothing parameters, ⍺ and β respectively. We apply this type of model when there is trend but no seasonality.

Let’s jump right into the code.

#Let's check accuracy score by using RMSE
Root mean squared error = 19269.25908614305

We can see that the forecast is doing better with a lower RMSE. There is also a linear trend component without any seasonality.

Other than the inability to capture seasonality, this method also doesn't stabilize or rather ‘dampen’ the trend over time. It will simply shoot higher and higher into the future periods. This is not advisable, especially when we forecast for a long period. One workaround to this is to add a damping parameter. This is an extension to Holt’s linear trend model.

#Let's check accuracy score by using RMSE
Root mean squared error = 19646.815465033545#You can see our RMSE has suffered to some extent due to damping. #This is a trade off that we need to consider while selecting the #model.

Holt-Winter’s Model or Triple Exponential Smoothing

Let’s bring back our seasonality. This model takes into account the level, trend and seasonality in the series. Since trend and seasonality are present, Holt-Winter’s model can be additive or multiplicative. There are three smoothing parameters : ⍺, β and 𝜸 for level, trend and seasonality respectively. If you recall that our superstore data had both trend and seasonal components in the decomposition plot, we can expect this model to give us better accuracy. Also we assumed an additive model.

#Let's check accuracy score by using RMSE
Root mean squared error = 17660.235617816732#Note that the model automatically applies smoothing parameters; you #can manually tune them by specifying smoothing_level, #smoothing_slope and smoothing_seasonal values.

You can observe that the model has captured both trend and seasonality in the plot. Our RMSE score has also improved.

In a nutshell, we have seen that our model has both trend and seasonal components and they are stitched in an ‘additive’ manner. Consequently, Holt-Winter’s model gave us the best results. Next we will have a look at a more sophisticated auto regression models.

ARIMA model

ARIMA stands for Auto-Regressive Integrated Moving Average. Let’s break it down.

An Auto Regressive model tries to identify significant correlations with its lags( previous time periods ) that make up the AR(p) component. For example, an AR(1) process will imply that the value at time t is a linear function of the value at time t−1(in other words, lag 1). Mathematically, an AR(1) model will look like:

Y(t) = 𝛿 + ɸ*Y(t-1) +w(t), where w(t) is the white noise at time (t) which is normally distributed with N(0,𝛔²)

But what is correlation and how is it calculated? Let’s have a look at the formula. Consider two variables x and y. The correlation is given by:

Correlation = Cov(x,y)/(𝛔x * 𝛔y)

The numerator is the covariance of x and y, and denominator is the product of standard deviations of x and y. The correlation coefficient, which is the result of this formula, gives an estimate of how well x and y are associated taking values between -1 and 1. If you are familiar with Pearson’s r , you will find the similarity here. For AR(p) process, x and y are nothing but Y(t) and Y(t-p).

A Moving Average term MA(q) in a time series model is a past error (multiplied by a coefficient). Each of the error terms at the corresponding lags are assumed to have normal distribution with constant variance and mean zero. Mathematically, an MA(1) process will look like:

Y(t) = µ + w(t) + 𝛉*w(t-1), where w(t) is the white noise at time (t) which is normally distributed with N(0,𝛔²) and µ is the mean of the series

Integrated stands for the number of times the series is differenced. Differencing is a process of subtracting the value of the current time period with the previous time period. This will make a differencing of order one. We can do it ‘d’ number of times which makes our series a differenced series of order ’d’. But why do we difference our series — enter the concept of stationarity which is a mandatory criteria for auto regressive models.

So let’s understand stationarity.

A series is stationary if and only if :

a)The mean of time series is a constant — this implies that time series with trend or seasonality are non-stationary. A trend or seasonal component will affect the value of time series at different periods.

b)The variance/covariance is constant over time

c)The correlation between time period (lag) t and t + mth is always a constant — this implies that January and November will have the same correlation as December and October.

Overall, a stationary series is one where the statistical properties do not change over time.

There are couple of ways to make a non-stationary series stationary. These are log transformations ( for smoothing the variance ), eliminating trend & seasonality after decomposition, calculating moving averages and differencing the series. Among the above stated approaches, differencing is widely used for making a series stationary which we have already described.

Let’s test our data for stationarity. We will use Augmented Dickey-Fuller test to test our null hypothesis that our series is not stationary. If we get a significant p value, we will reject our null hypothesis and establish that our series is stationary. On the contrary, if we fail to reject our null hypothesis, we will difference the series and run the test again.

Test Statistic well over the critical value with 99% confidence

After looking at the results, our null hypothesis can be rejected. We can say with 99% confidence that our data is stationary.

If we were to use decomposition instead of differencing, we can test for stationarity of the residuals using the same method

Let’s move on. The next step is to explore ACF ( auto-correlation function ) and PACF ( partial auto-correlation function ). Remember we talked about AR(p) and MA(q) processes? ACF and PACF plots help us identify the orders p and q.

ACF looks for correlations with the lags of the series. A series with lag 1 is that which is shifted by one time period. The below table illustrates lagged series of order 1,2,3 and so on.

ACF plots the correlations between original Series and Lag 1, Lag 2 etc. Let’s see how to do it in python.

Any spike above the blue bands is significant. We can see that lag(1) and lag(12) are significant. Also, we can see that there are rises and falls across lags with gradual decay in correlations (observe every12th lag). There is a high chance that correlation between say, lag(1) and lag(4) are influenced by the effects of in-between lags — lag(2) and lag(3). Hence it is important to eliminate these and get a true picture — PACF plots are used for the same purpose.

We can see that lag(1) is significant among other shocks

The above PACF plot shows lag(1) is significant among other signals or shocks. Lag(12) is significant here as well.

How do we interpret ACF and PACF plots?

The below matrix should help us choose our best order:

Based on our plots, it looks like there are both seasonal and non-seasonal components. Spike at each 12th lag denote seasonality and we may have to difference it. We can iterate through different combinations and check the AIC (Akaike Information Criteria) and BIC (Bayesian Information Criteria) scores. Our goal is to minimize these scores for better model performance.

Build ARMA, summary statistics and accuracy scores

#Call the function and fine tune parameters
# d=0 as we did not difference the series
#Try with p=1, q=1 based on our ACF and PACF plots
#Alternatively, start with (1,0,0) AR model and (0,0,1) MA modelarima_model(1,0,1) #ARMA model

#MAPE
0.6571285433084803#RMSE
23037.553921953575

The accuracy isn’t great. We can try different combinations to see what gives us the best accuracy scores. Also, the MA(1) coefficient isn’t statistically significant. We can drop it and look at our results again. This iteration can go on until we find an optimal accuracy. Note that we haven’t included seasonality yet. This paves way for seasonal ARIMA.

Let’s have a look at Seasonal ARIMA. We will use the parameters for non -seasonal arima (p,d,q) and add a corresponding seasonal order (P,D,Q)m where m is the seasonal cycle; since our data is at monthly level and the patterns repeat every year, m=12.

#Since we saw significant spikes at 12th lag in our ACF and PACF #plots, we can try the followingsarimamodel(0,1,0,12)
sarimamodel(1,0,0,12)
sarimamodel(0,0,1,12)#We need to minimize our AIC and BIC values

We can see a significant improvement in MAPE

Analyze residuals to see if they are normally distributed

We can see a significant improvement in AIC, BIC, MAPE scores after bringing in seasonality in the model. The model residuals also follow almost a normal distribution as explained by :

Q-Q plot-this is a scatterplot of two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight.

Standardized residuals-this shows that our residuals after being standardized, are centred around zero.

Histogram and KDE plots the distribution of standardized residual and compares it with a normal distribution.

Correlogram is used to see if there are any significant auto correlations among residuals; we don’t have any. This implies that residuals are stationary and do not have any pattern.

Stationarity of residuals is a must for finalizing a model

However, if you observe the results, we still do not have significant coefficients and we should iterate for other combinations.

Fortunately, python provides an inbuilt function auto_arima (part of pmdarima package) which saves us time from identifying the optimal order of ARIMA. Below is an illustration:

Automatically searches for the best model that minimizes AIC

We can see that auto_arima has chosen ARIMA(1,0,1)(1,0,0)[12] as the best model. The residuals also look stationary. If you do not get statistically significant parameters, we can differentiate the series and run the ARIMA model again by following the same steps. On quickly running auto_arima, it gave me ARIMA(2,1,0)(1,0,0)[12]. I will leave it at that and encourage you to try differencing and modeling.

#Differencing
data_diff = data-data.shift()
plt.plot(data_diff)

Note that auto_arima doesn’t do so well in capturing seasonality. Hence simply relying on auto_arima will not yield the best model. We have to carefully examine the ACF and PACF plots, understand seasonal effects if any, plot auto_arima for further experimentation and apply business domain knowledge to finalize the model parameters.

I hope this gives an idea to start implementing time series forecasting. I’d like to highlight that we did not split the data into train and test in ARIMA model. This is because we did not have the luxury of many data points to model. It is a decision that you make depending on the volume of data you can work with.

A Blueprint for Time Series Forecasting

An exploration of exponential smoothing and ARIMA models

Written by Jerry Paul