Time Series Forecasting in Real Life: Budget forecasting with ARIMA

In-depth example on how to forecast with ARIMA

Carolina Bento
Towards Data Science

--

We're surrounded by phenomena that can be described by a time-series. This is a fancy way of saying that a lot of things or events, can be described as sets observations that happen over the course of a certain period.

It still sounds complicated, so here are a few examples of "things" that can be represented as time-series

  • Stock prices
  • Weather conditions in specific regions
  • Electricity consumption in an household
  • Heart rate monitoring
  • Total sales in a store

But time-series are not just things that happen over time. There are a handful of components that make them the way they are:

  1. Trend are the values in the time-series are increasingly higher or lower?
  2. Seasonality is there a pattern in the dataset that is influenced by seasons? For instance, are online sales increasing during the holidays or ice cream sales increasing in the Summer and decreasing during the Winter?
  3. Cycle can we see a certain change pattern that is independent of the time frame? Meaning is not attached to a certain season.
  4. Irregularity what other random factors that may affect the data?

If we understand these components, and have a big enough dataset, we can use past observations, i.e. historical data, and what other information we know about the time-series to predict how it is going to behave in the future. Like a weather forecast, or the sales volume forecast for next month.

New Year's resolutions are big deal, and because this year is just starting, it's the perfect time to set goals.

Your New Year's resolution is to be more financially conscious, so you decided to create a monthly budget. Problem is, you don't quite know where to draw the line.

Sure monthly expenses are not always constant, but some of patterns might emerge throughout the year, like spending more on during Christmas time and when you take a vacation.

We can think about our monthly expenses as a time-series, something that can be measured over time.

So you start digging into old bank statements to create your expenses dataset.

Beautiful dummy data

Awesome! Monthly expenses ✅ Values over time ✅

Let’s forecast! Well … not so fast 😁

Slow down, until it's stationary

To build a time-series model, one that you can use to predict future values, the dataset needs to be stationary.

This means that first we need to remove any trend the series might have, such that the dataset has the following properties:

  1. Mean is constant
  2. Variance is constant, i.e, the variance at different points in the time-series is the same. This property is usually referred to as homoscedasticity
  3. Auto-covariance between two data points, one in time t1 the other in time t2, is not dependent on time. The auto-covariance between these data points only depends on the difference between t1 and t2

Is my time-series stationary?

As with many data problems, the answer to this question is a two-step process: 1) plot the data, and 2) test your assumptions

Plotting the data

It's very important and valuable to spot-check the data and get more familiar with it before starting any analysis. You'll find it easier to spot data quality issues or outliers that should be removed or analyzed separately if you spend some time looking at the data.

Plot of the monthly expenses dataset

Testing your assumptions

You might not be able to see if the dataset is stationary by simply looking at it. In this case, it's really hard to tell!

That's where the Dickey-Fuller Test can help us. It is a statistical test, where the Null Hypothesis states there is a unit root for the given series, while the alternative hypothesis states that the series is stationary.

Like in any other statistical test, we're going to reject the Null Hypothesis if the p-value is less or equal to the significance level, which is typically 1%, 5% or 10%. Let's set our significance level at 1%, such that we reject the Null Hypothesis with 99% confidence.

For our time series to be stationary, the p-value has to be ≤ 0.01.

Testing if the time series is stationary

Which is not the case ☹️

So, we'll have to transform the dataset and perform the Dickey-Fuller test again. A common transformation used in Mathematics, which is used because it doesn't impact the properties of the data, is the log-transformation. This means we'll compute the logarithm of each data point in the time-series.

Testing if the time series is stationary, after a log-transform

We’re still not there yet, our time series is not stationary.

Here’s the code I used to run the Dickey-Fuller test, with the option of doing a log-transform.

# log_dataset: boolean indicating if we want to log-transform the dataset before running Augmented Dickey-Fuller testdef adf_test(dataset, log_dataset):
ds = dataset
if log_dataset:
ds = dataset.apply(lambda x: log(x))
ds.dropna(inplace=True)
result = adfuller(ds) print('Augmented Dickey-Fuller Test')
print('test statistic: %.10f' % result[0])
print('p-value: %.10f' % result[1])
print('critical values')
for key, value in result[4].items():
print('\t%s: %.10f' % (key, value))

Our time series is still not stationary

We've tested the original dataset as well as the log-transformed dataset, but our time series is still not stationary.

What other options do we have? We can apply other techniques that transform the data, without changing its properties:

  1. Differencing subtract each data point by the value of a specific time point in the series, e.g., always subtract by the value of the next period
  2. Decomposition this technique is going to isolate each component of the time-series that was mentioned at the beginning (trend, seasonality, cycle, irregularity) and provide the residuals

In our case, we're going to try differencing the dataset.

Differencing

Since differencing is subtracting, let's keep it simple and start off by differencing each data point from the data point before it, i.e., differencing consecutive values.

# data: our dataset
# column_name: column to difference
pd.DataFrame(data=np.diff(np.array(data[column_name])))
n_diff_dataset.columns = [column_name]

# dropping NAN values
n_diff_dataset.dropna(inplace=True)

Running the Dickey-Fuller test again we see that we’re still not able to reject the Null Hypothesis with a significance level of 1%.

Results of Dickey-Fuller test after differencing consecutive values

The p-value is ≤ 0.01. The dataset is stationary 🎉🎉🎉

Let’s take a look at the stationary time-series.

Time series after second order differencing — stationary at last!

N-order differencing and seasonal differencing

Our time series is finally stationary, after differencing. But if that was not the case, we could try to continue on differencing the time series.

Differencing doesn’t mean you’re subtracting the value of n prior periods, or subtracting lagged values. That’s seasonal differencing.

Seasonal differencing

The name gives it away, well … a bit. If you’re applying seasonal differencing to your dataset you’re subtracting by a previous datapoint in the same season.

In our example we’re dealing with monthly data, so each year will correspond to a season containing 12 months. So, calculating the seasonal difference for the month of January of any given year, means subtracting by current value by the value of January of the previous year.

Generalizing, it looks something like this

Seasonal differencing

In the monthly expenses example one season is one year, so n=12.

N-order differencing

That’s what we did with our dataset, we applied first order differencing. Which in practice means subtracting each data point in the time series by the data point in the period right before it, as in, lag=1.

First-order differencing

But what if we were to keep on differencing? Well, we’d literally just keep on differencing the differences 🤯🤯🤯

In this case, looking at the math actually helps! Let’s take the example of second-order differencing, where we're differencing twice.

Second-order differencing

By looking at the formula, it now makes more sense and it’s easier to see that n-order differencing doesn’t mean a lag of n periods, but actually performing the differencing operation n times

And if you need to differentiate your dataset an arbitrary number of times, you can to use the diff method in numpy and set parameter n.

Now, back to forecasting 📈

Time series modeling

There are several ways you can model a time series, the most popular are:

Simple moving average

With this approach, you’re saying the forecast is based on the average of the n previous data points.

Exponential Smoothing

It exponentially decreases the weight of previous observations, such that increasingly older data points have less impact in the forecast.

ARIMA: Auto-regressive Integrated Moving Average

This is the approach we’re going to use.

The ARIMA model can be broken down into three different components, each one with a parameter representing the characteristics of the time series.

1. Auto-regressive: AR(p)

Auto-regressive models explain random processes as linear combinations, such that the output variable depends linearly on its previous values and a random variable.

In short, it’s a model based on prior values or lags.

If you’re predicting the future price of a stock, the AR model will make that forecast, or prediction, based on the previous values of the stock.

If we look at the math, we can describe the AR(p) model with parameter p:

The parameter p indicates the number of autoregressive terms, as in, the number of terms in your linear combination.

2. Integrated: I(d)

The name is misleading, but this actually has to do with how many times the dataset was differenced, which is indicated by the value of parameter d.

3. Moving Average: MA (q)

Similar to auto-regressive models, in moving-average models the output variable is explained linearly, but this time is an average of the past errors.

Putting it all together, the formula for the ARIMA(p,d,q) looks like this.

But now the question is how do we figure out which parameters to use?

Picking parameters for ARIMA(p,d,q)

We already know how many times we've had to difference the dataset, so the value of parameter d is 1.

But we still need to figure out the values of p and q. And for that we’re going to look for autocorrelation, AR(p), and moving average, MA(q), profiles.

Autocorrelation (AR) profile

This is done by testing the correlation between the data points in the time series with themselves at different lags, i.e., at points in time.

For that, we’ll use the Autocorrelation Function plot, ACF plot for short.

With the ACF plot, we can spot the autocorrelation (AR) profile when

  • ACF data points are sinusoidal or exponentially decaying
  • PACF has a spike, or a few consecutive spikes, and cuts off sharply after

Moving Average (MA) profile

To determine the moving average profile we’ll use a subset of ACF, the Partial Autocorrelation Function plot, usually referred to as PACF plot.

PACF represents the autocorrelation at different lags, but it removes the lower-order correlations, i.e, all the correlations between 1 and lag-1, because everything in between is going to be inherently correlated.

With the ACF plot we can spot the autocorrelation (AR) profile when we see the reverse of what was described for the AR profile:

  • PACF data points are sinusoidal or exponentially decaying
  • ACF has a spike, or a few consecutive spikes, and cuts off sharply after

On top of this, the spikes in the plot have to be statistically significant, meaning they are outside the area of the confidence interval. This confidence band is either represented by horizontal lines or an area like in an area chart, depending on the software you use.

Let’s take a look at ACF and PACF plots side by side.

Autocorrelation and partial autocorrelation plots for our dataset

The ACF and PACF at lag= 0 are usually 1, because each data point is always correlated with itself.

Analyzing the ACF plot, we can see any spike slightly outside of the confidence band, so we’ll assume that AR(2).

As for the PACF plot we can see the first spike at lag=2, so we’ll pick MA(2).

We have our model, ARIMA(2,1,2) 🙌

To fit the model I decided to split the dataset between training and testing subsets, using the last 30% of the observations as test data.

# split dataset between training and testing
cutoff_30pct = len(y) — int(len(y)*0.3)
y_train = diff_dataset[y_col][:cutoff_30pct]
y_test = diff_dataset[y_col][cutoff_30pct:]
# building the model with the parameters we've discovered and fitting it to the training set
model = arima_model.ARIMA(y_train, order=(2,1,2))
arima_model_fit = model.fit(disp=-1)
print(arima_model_fit.summary())
Results for ARIMA(2,1,2)

AIC and BIC values are used to compare the quality of fit of different models, when applied to the same dataset. In our case, we’re not comparing multiple models, so we’re not going to look too much at these values.

To understand the quality of this particular model, we’ll need to use other metrics in our toolbox.

Is this model good for our dataset?

To evaluate the quality of the model we’ll first compare the forecasted values with the actual values that we set aside in the test subset.

Comparing actual values with forecasted values

From here we can see the forecasted values, in green, are a bit off compared with the actual values, in orange.

It becomes clearer when you forecast against the entire dataset.

fig, ax = plt.subplots(figsize=(15,7))
model = arima_model.ARIMA(differenced_dataset, order=(2,1,2))
arima_model_fit = model_all.fit(disp=0)
# dynamic=False meaning in-sample lagged values are used for prediction
arima_all_fit.plot_predict(dynamic=False)
Forecasting against the entire dataset

Ok, we know that our forecasts are a bit off, but how off?

To get a sense of the difference between actual values and forecast, we can use the Mean Absolute Error.

arima_mae = mean_absolute_error(y_test.values, forecast)
print('Mean absolute error %.2f' % arima_mae)

Our ARIMA(2,1,2) has a mean absolute error of 235.89, which means that on average the values will be 235.89 units off.

This might mean:

  • there’s not enough data to make accurate predictions
  • ARIMA parameters could be further adjusted
  • ARIMA might not be the best model for this problem, one idea is to try a simple linear regression or exponential smoothing and compare the AIC and BIC

Conclusion

In the example we’ve been working on, the data is randomly generated with a few tweaks to create a bit of a trend, so this result could be slightly off. The main goal of this article was to walk through the different steps of fitting a ARIMA model.

With real data, you need to look at the mean absolute error and residuals in the particular context of your problem.

It might sound a bit vague, but the context and your knowledge of the problem are very important in Data Science. If you’re predicting the monthly revenue of a multi-million dollar company, being off by $235 might not be significant. On the other hand, if you’re predicting a household monthly budget being $235 is more worrisome.

Hope you enjoyed reading through this example, and happy forecasting 📈

Thanks for reading!

--

--