End-to-End Time Series Analysis and Forecasting: a Trio of SARIMAX, LSTM and Prophet (Part 1)

Predicting aggregate daily energy consumption for a selected number of locations in the City of Helsinki

Bruce Nguyen

Published in

Towards Data Science

12 min readFeb 6, 2021

In collaboration with Alex Le.

Part 2: End-to-End Time Series Analysis and Forecasting: a Trio of SARIMAX, LSTM and Prophet (Part 2) | by Son Le | Dec, 2021 | Medium

Introduction

Time series, or series of data points indexed in time order, is a ubiquitous type of data. Economists analyze economies by looking at how they performed in the past; weather forecasts are generated partly based on historical numbers, and so on. Any quantitative problem that concerns a temporal dimension, overall, will inevitably involve working with time-series data. Thus, time series analysis and forecasting has been an actively researched area, with tangible rewards promised for academics and businesses alike.

In this article, we will walk you through 3 of the most popular techniques/tools currently used to forecast a time series: a classical SARIMAX model, an LSTM neural network and Prophet. All the content will be based on our project mentioned above. Its scope is to analyze and predict the aggregate energy consumption level from a selected list of locations in the City of Helsinki. In particular, we will go through the post in this order:

The Data
Exploratory Data Analysis (EDA)
Baseline Model
SARIMAX
LSTM
Prophet

For the more ‘hardcore’ viewers, you can get right into the code (commented, unfortunately :) ) in our notebooks on the project website.

The Data

We fetched the dataset from www.avoindata.fi, an open data repository maintained by the Finnish Digital and Population Data Services Agency. You can find it in the Energy category, titled ‘Energy consumption data of the City of Helsinki’s utility and service properties.’ On the dataset’s page, there is a link that takes you to the API documentation.

For our project, we used three years of daily and hourly data of electricity consumption from 2017–2020 exclusive. This is because we found that it is just the right amount of data for modeling — no need for older data, and the year 2020 itself has been a significant outlier in terms of energy consumption due to the pandemic — no need for newer data. Here is what the procured raw dataset roughly looks like:

Right away, we can see that there is a lot of redundant information, including the building addresses even. To get the aggregate energy consumption, the values along the locationName feature needs to be summed up. However, this process was complicated by how common missing values from locations are, implying the need for interpolation. Nevertheless, we arrived at the desired form of the data after some work (of course, you can see what we did in details in the notebooks):

That’s it for data preparation. Now we get to the fun part of visualizations!

Exploratory Data Analysis

By having a first overview of the data, we can already recognize the notable yearly (seasonal) electricity demand pattern, with peaks during winter-spring then bottoms out in summer. Our assumption can be checked further by looking at the data from each year.

The data behaves just as we expected, with an almost identical pattern across 3 years. That said, we can see some abnormalities in September 2018, where energy demand unexpectedly drops by a visible margin. Further investigations into this were carried out but with inconclusive results.

To go even further, we can zoom in on the data within every year, analyzing the patterns across the months, weeks and days.

Take the first two months of data in 2017 as an example. The repeating patterns here is very striking. By comparing them with the respective date, it is evident that the data has a 7-day seasonality — in other words, a weekly one. From the plot, we can see the energy consumption is at its height during the weekdays, then drops significantly during weekends. This very much fits our expectation, since factories and office operations, which consume a lot of energy, follow such a schedule.

Time Series Decomposition

Now that we get a better feel for the data, we will get more technical by using a statistical technique called time series decomposition. Conceptually, it works by dividing the time series into 3 components:

Trend-cycle — increases or decreases within the data in the long-term or not of a fixed frequency. Ex: energy demand goes up over time as there are more people and businesses.
Seasonal — pattern occurs when a time series is affected by seasonal factors. Ex: the weekly pattern that we just saw in the data.
Residuals — the remainder after removing the two aforementioned components. As every time series is inherently a stochastic process, there will always be random noises in the data. Ex: One building may have a few broken light bulbs, or some employees are sick, and the energy level thus fluctuates.

In particular, we used the Loess decomposition implemented in the class STL within the package statsmodels. For clarity, we only apply the decomposition on the data from one year:

stl = STL(ts2017,seasonal=7,robust=True)
res = stl.fit()
fig = res.plot()

It can be seen that the plot only further confirms our existing belief about the data. With the decomposition done, the exploratory phase comes to an end. Now, it’s modeling time!

Baseline Model

In every modeling process, there needs to be a baseline model whose results can be used to assess our primary ones. In our case, we chose to use a Linear Regression model because of its simplicity and efficiency.

Regarding the modeling procedure itself, our test set composed all the data in December 2019. You may already have guessed that forecasting this time period would be very challenging due to the holiday season. And, worry not, this is precisely our intentions! We want to see just how the models handle exceptional times. And, a little challenge only makes things more interesting, isn’t it?

Without further ado, we get the forecasts from linear regression with a few lines of code:

Now we know that our models need to do better than an error of roughly 90,000 kWh!

SARIMAX

To use this technique, we first need to know the basics.

What is SARIMAX?

Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors, or SARIMAX, is an extension of the ARIMA class of models. Intuitively, ARIMA models compose 2 parts: the autoregressive term (AR) and the moving-average term (MA). The former views the value at one time just as a weighted sum of past values. The latter model that same value also as a weighted sum but of past residuals (confer. time series decomposition). There is also an integrated term (I) to difference the time series (we will discuss this further below). Since this is such a rich topic with a lot of math involved, we strongly recommend you to do further readings to have a better understanding.

Overall, ARIMA is a very decent type of models. However, the problem with this vanilla version is that it cannot handle seasonality — a big weakness. Comes SARIMA — the predecessor of SARIMAX. One shorthand notation for SARIMA models is:

where p = non-seasonal autoregressive (AR) order, d = non-seasonal differencing, q= non-seasonal moving average (MA) order, P = seasonal AR order, D = seasonal differencing, Q = seasonal MA order, and S = length of repeating seasonal pattern. We will use this notation from now on. By adding those seasonal AR and seasonal MA components, SARIMA solves the seasonality problem.

SARIMAX extends on this framework just by adding the capability to handle exogenous variables. Holidays is the go-to option, but you can also get your own domain-specific features if you need to. In our case, we fetched the list of holidays in Finland from the package holidays:

The SARIMAX implementation we use in the project also comes from the package statsmodels.

Lag

Lags are simply delays in time steps within a series. Consider a time index t, the lag 1 time index with respect to t is simply t-1, lag 2 is t-2, and so on.

Stationarity

A stationary time series is one that has its mean, variance and autocorrelation structure unchanging overtime. In other words, it does not have any cycle/trend or seasonality. The ARMA models family is actually built on this concept.

Autocorrelation function (ACF) and Partial Autocorrelation function (PACF)

Both of these functions measure how correlated the data at time t is to its past values t-1,t-2,… There is one crucial difference, however. The ACF also measures indirect correlation up to the lag in question, while PCAF does not. In practice, their plots are vital for many tasks, especially choosing the parameters for the SARIMAX model. You can read more about how to interpret such plots here.

With the basics out of the way, we proceed to model the time series using the Box-Jenkins procedure.

Model Identification

Differencing to achieve stationarity

We start by making sure that our data is stationary. Looking at the plots we made initially, it is evident that the data is not stationary with such a clear trend and seasonality. However, we can be more scientific with our guess by employing a statistical test: the Augmented Dickey-Fuller test, also implemented in the statsmodels package.

def test_stationarity(timeseries,maxlag):
    # Perform Dickey-Fuller test:
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries,maxlag=maxlag,
                      autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (round(dfoutput,3))

Just as expected, the p-value is bigger than 0.05. Thus, we cannot reject the null hypothesis, and the time series is not stationary. The question now is, “how do we make it so?” The answer in the procedure is differencing — represented by the d and D orders of the Integrated term. Differencing the data simply means to take the difference between the data points and its lagged version. Intuitively, this is analogous to taking the derivatives of functions.

This looks more like it! We turn to the Dickey-Fuller test again for verification.

Perfect! Now we know the parameters for the Integrated terms in SARIMAX: 1 for both seasonal and non-seasonal ones. The next step in the modeling process is finding the order for the AR and MA terms using ACF and PACF plots.

Identify p and q using ACF and PACF plots

In the plots, we can find intricate autocorrelation patterns with no clear-cut interpretations. Thus, orders for both seasonal and non-seasonal AR and MA terms cannot be decisively chosen, complicating our process. This demonstrates how real-life statistics can be much messier than textbook examples, and we have to make do with the uncertainty of our own choices. Fortunately, we can avoid any unscientific guessing by employing an optimization method called Grid Search.

Model Estimation

Grid Search

This algorithm simply conduct an exhaustive search over all the combinations of parameters. The best one among all of them will be chosen according to a loss function of our choice. In our case, we use the popular Akaike Information Criterion (AIC) as per the standard in the ARMA modeling process.

def sarimax(ts,exo,all_param):
    results = []
    for param in all_param:
        try:
            mod = SARIMAX(ts,
                          exog = exo,
                          order=param[0],
                          seasonal_order=param[1])
            res = mod.fit()
            results.append((res,res.aic,param))
            print('Tried out SARIMAX{}x{} - AIC:{}'.format(param[0], param[1], round(res.aic,2)))
        except Exception as e:
            print(e)
            continue
            
    return results# set parameter range
p,d,q = range(0,3),[1],range(0,3)
P,D,Q,s = range(0,3),[1],range(0,3),[7]
# list of all parameter combos
pdq = list(itertools.product(p, d, q))
seasonal_pdq = list(itertools.product(P, D, Q, s))
all_param = list(itertools.product(pdq,seasonal_pdq))

all_res = sarimax(train,exo_train,all_param)

We end up with our top 5 models after the search:

And right away, we can use the first set of parameters. Even though the parsimony principle (sum of parameters < 6) is violated for this model, the margin is small enough, and we can gain from some flexibility.

Model Validation

Residual Diagnostics

To determine the goodness of fit of the model, we can examine its residuals using the standard assumption: they should be normally distributed around 0, or in other words, white noise.

We can check this by looking at the various plots showing the distribution of the residuals. This can be generated conveniently using the plot_diagnostics method. In addition, the Ljung-Box test can also be used to do this more precisely.

res.plot_diagnostics(figsize=(15, 12))

plt.show()
print("Ljung-box p-values:\n" + str(res.test_serial_correlation(method='ljungbox')[0][1]))
res.summary()

In the plots, the residuals seem to be normally distributed around 0 — which is the condition that we need — with slightly heavy tails. However, looking at the Ljung box statistics, we cannot reject the hypothesis that the data are not independently distributed, since the p-values are smaller than α=0.05 for some lags from 6 onwards.

Nevertheless, let us use this model to predict on the test set and judge it for ourselves.

fig, ax = plt.subplots(figsize=(12,7))
ax.set(title='Energy consumption', ylabel='kWh')ts.plot(ax=ax, style = 'o')
pred.predicted_mean.plot(ax=ax, style='o')
ci = pred_ci.loc[demo_start:]
ax.fill_between(ci.index, ci.iloc[:,0], ci.iloc[:,1], color='r', alpha=0.1)

Overall, the in-sample prediction seems to fit the time series pretty well! There is no notable pattern of error, except that the model seems to predict the energy level during weekdays better than that of the weekends. Now, let’s see how it forecasts on the test set:

The out-sample prediction also looks also very satisfactory! During the first 2 weeks of the month, the forecasted values fit the actual ones well, with no systematic error in prediction. There is, however, possibly an exception, with the values on the 6th of December being wildly incorrect. Fortunately, we know the simple reason for it: it was the Independence day of Finland.

With regards to the winter holiday season, the model unfortunately did not do as well. During the period, the forecasts consistently over-predicted the electricity consumption, despite the addition of the exogenous ‘holidays’ variable. Apparently, not only the Christmas holidays saw the energy dropping, but the whole winter holidays period as well. And, that information is not incorporated in the model. This explains why the error rate remains very high — not much of an improvement over the baseline model.

We redid the process again behind the scenes with a different split of train-test, and the results expectedly got much better. The error rate dropped to around 30,000 kWh or less than 5%. Not so bad for a class of model theorized 50 years ago, right? Nevertheless, all of this shows the challenges of forecasting during exceptional periods, emphasizing the need for even better techniques. Aside from that, this model still shows promise in forecasting when the data behaves predictably.

Oops… This post has gotten too long already! Also, this seems like a good place to stop, with all the SARIMAX deployment covered. In the next post, we will get to the ‘modern’ section of the Time Series Analysis textbook, so to speak. We will discover the process of fitting time series with Prophet. That said, thank you very much for getting this far through walls of text :). See you next time!