The world’s leading publication for data science, AI, and ML professionals.

Forecast Different Levels: Introducing Scalecast Pt. 3

Directly compare models run on stationary and non-stationary data

Photo by Chor Tsang on Unsplash
Photo by Chor Tsang on Unsplash

Welcome back to the concluding part of the scalecast series. Part 1 introduced forecasting on one series. Part 2 scaled the approach to many series. This part shows how to model the same series on different levels and compare the results fairly and easily.

GitHub – mikekeith52/scalecast: A flexible, minimal-code forecasting object.

Note: this post was originally written to showcase the differencing and leveling features of an early version of scalecast. Automated and more sophisticated series transformations have since been introduced. See the scalecast introductory notebook. Most of what is written here is still possible with the current version of scalecast.

Why should we care about levels?

Some consideration has to be taken to any given time series before forecasting it to ensure that it is stationary, meaning its mean and variance do not change randomly over time. Otherwise, much of what you will try to model will be random. Models such as Prophet and Silverkite deal with non-stationary data by using piecewise functions— Prophet uses a bayesian regression approach while Silverkite uses a linear model with regularization. ARIMA stands for Autoregressive Integrated Moving Average; "Integrated" in this case refers to a mechanism within ARIMA to handle non-stationarity. Exponential Smoothing weighs recent observations more heavily than earlier observations. In all the above ways, non-stationary trends are handled so that accurate forecasts can be evaluated.

Other machine learning models commonly used for forecasting, such as decision trees, non-piecewise linear regression functions, k-nearest neighbors, support vector machines, and neural networks, give no consideration for non-stationary data. In cases where you try to feed them such data anyway, the random trend variations heavily influence the models, leading noise to be interpreted as signal and spurious results. You may get lucky and still obtain seemingly good forecasts, but your model is likely to not be generalizable for many steps into the future.

To remedy this, we can take a series’ first or second difference before feeding it to certain models; every observation in the time series then becomes the difference of the previous observation and itself. Thusly, much of the noise is stripped out of the series, leaving behind the cyclical signals, such as seasonality, lagged effects, and/or specified covariates, which can be modeled. The only downside to this strategy is a loss of interpretability and more work on the back-end to make the forecasts usable to others.

A good statistical test to determine if your data is stationary is Augmented Dickey Fuller. Its null hypothesis is that the data you give it is not stationary and if you want to be safe, this should also be your default assumption when working with time series generally. It is far more of a serious sin to assume your data is stationary when it is not than vice versa, at least this is what my econ teachers always told me.

Run models on different levels

What do you do when you want to compare forecast results from models that take stationarity into account and others that don’t? One answer is to run all models with stationary data to be safe, but a more dynamic approach is to run some series on level data and others on differenced data. Error and accuracy metrics can then be compared based on each model’s performance on level data.

Executing this strategy can be challenging, but it is easy when using scalecast (see the full script used in the example here). First, install the package and take care of other requirements:

pip install scalecast
pip install pandas-datareader
pip install lightgbm
pip install fbprophet
pip install greykite

Next, import libraries and load data:

We examined this data fairly well in part 1. We use the same preprocessing steps this time— differencing the series to make it stationary; specifying test and validation periods; and adding seasonal regressors, autoregressive terms, a time trend, and a few other variables.

Let’s apply four models to the differenced data — K-nearest Neighbors, Support Vector Machine, Light Gradient Boosted Machine, and Multi-Level Perceptron. We can also select four models that are supposed to work well on data whether or not it is stationary: ARIMA, Holt-Winters Exponential Smoothing, Facebook Prophet, and LinkedIn SilverKite. We apply the latter four on the data at its original level. We also add two combination models for each of the two model types, leaving four combination models total.

To tune each one of these models, we place a Grids.py file in the working directory. An easy way to do that is by using:

from scalecast import GridGenerator
GridGenerator.get_example_grids()

When running your own applications, you are encouraged to manually alter the grids in this file as you see fit.

Visually compare results

The interesting part of the analysis comes when comparing results. First, for the sake of brevity, let’s select a single error/accuracy metric to trust (comparing several metrics could be more interesting, but it is outside the scope of this post). Since forecasts were run at different levels, we should choose one that makes sense for all series. I like LevelTestSetMape . When delivering forecasts to senior decision makers in an organization, people are usually most concerned about how far off the results are expected to be from actuals. It is appreciated in those situations when you can quickly state a percentage point – that way, no confusion about scale or statistics arises. Mean Absolute Percentage Error (MAPE) allows us to do that, and LevelTestSetMAPE usually evaluates fairly on any series you are likely to forecast. Let’s explore the plotted forecasts, ordered best-to-worst according to this metric:

Image by author
Image by author

K-nearest Neighbors returned the best results (0.07 MAPE), followed by Multi-Level Perceptron (0.08 MAPE), both run on differenced data but visually shown here on level data. The best model run on level data was the simple average combination model (0.08 MAPE), however, looking at that model visually, it evaluates quite a bit below what is intuitive. Let’s take a closer look at those test-set results:

Image by author
Image by author

It looks like the favorable test-set metric for avg_lvl was due to the ARIMA model having evaluated quite a bit above the other level models for that particular segment of data (note – actually it was due to a bug in the code addressed in scalecast version 0.9.1). I wouldn’t trust either the ARIMA nor the combination models for my final forecast – they don’t perform reliably or consistently out-of-sample. Sometimes you see weird stuff like that; some of the models were so far off in opposite directions that it made the combination model seem reasonable at first glance. But you have to use common sense when making final decisions.

What’s interesting is how the models run on level data performed worse than the other models. SilverKite returned a MAPE of 0.17 and Prophet 0.23. Perhaps more importantly, their results don’t look believable. Although these models have mechanisms to deal with non-stationarity, it can still be a good idea to run them on stationary data. We would have probably derived better results had we done that. Another option would have been to spend more time working with their changepoint parameters (the points in the series in which the piecewise functions change), seeing if there were a way for them to more effectively pick up on changes in the series’ trend. When observing fitted values (in-sample predictions), we can see where the models run on level data may have gone wrong:

Image by author
Image by author

We see that Prophet wasn’t able to pick up on any changepoints in the series, following one trend throughout. It was not dynamic enough to reliably forecast this data. SilverKite also followed one trend throughout and wasn’t able to pick up on any of the series’ seasonality, possibly being overly regularized. The other two look like they might have overfit.

Although not all of the models evaluated on the differenced data were better than those that weren’t, differencing proved to be an important part of the process. We easily derived two very believable forecasts (KNN and MLP) simply by using stationary data.

Finally, I think it’s interesting to explore the selected hyperparameters of the best-performing models:

knn HyperParams: {'n_neighbors': 11, 'weights': 'uniform'}
mlp HyperParams: {'activation': 'relu', 'hidden_layer_sizes': (25, 25), 'solver': 'adam', 'random_state': 20}

Knowing this information serves as a starting point for obtaining even better results with future forecasting iterations.

Conclusion

In exploring these results, we see one example of forecasting on different levels and get a better idea of potential challenges that can be faced when attempting to do so. The scalecast package makes this process easy.

Thank you for following along. That concludes the series on scalecast. There is a lot this package can do and hopefully you can more easily access some of its functionality. I have written many more articles on this package since the original series concluded. Follow me for more forecasting and Data Science articles.


Related Articles