Welcome to part 2 of the series overviewing Scalecast, a new hope for scalable, minimal-code Forecasting. Part 1 overviewed forecast results from 10 models on one series; Part 3 will showcase how easy it is for the library to move between differenced and undifferenced series. In this part, we apply 11 models each to 108 time series. In doing so, we can see which models produce the best results most often.
By the end of this post, the following will be revealed:
- The Prophet model, on average, returns the lowest error metrics
- The Linear Regression model from Scikit-learn is most frequently considered the "best" model
- By employing 11 different models, we derive better forecast results than just using 1 model for all series
This analysis uses the Avocados dataset. View the scalecast on GitHub and give it a star:
pip install --upgrade scalecast
Let’s begin with library imports and loading the dataset into a pandas dataframe:
import pandas as pd
import numpy as np
import pickle
from tqdm.notebook import tqdm as log_progress
from ipywidgets import widgets
from IPython.display import display, clear_output
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from scalecastdev.Forecaster import Forecaster
data = pd.read_csv('avocado.csv',parse_dates=['Date'])
data = data.sort_values(['region','type','Date'])
The dataset we are using measures different Avocado sale volumes in various regions throughout the United States. It is also split between conventional and organic avocado types. Most of the series include complete weekly data from January 2015 through March 2018, but there are a few series with missing dates. Therefore, we use the pandas.date_range()
function to return every available date for each series. We fill missing dates with 0, assuming they are missing because there were no sales those weeks. This may not be a sound assumption, but it affects minimal series and works for the sake of this example. We store the resulting data into a scalecast.Forecaster()
object and place each of these objects into a dictionary:
This leaves us with 108 series, each with 168 weeks of historical data. We can view the last loaded series’ partial autocorrelation plot:
f.plot_pacf(diffy=True,lags=26)
plt.title(f'{f.type} {f.region} PACF Plot')
plt.show()
This shows some significant lags at 1 and 3, and then again at 22 and 23. We can add 3 autoregressive terms, but 22 and 23 seems random and may not work for all series. Let’s instead use 26 as a seasonal lag, a half year. Let’s also add "week", "month", and "quarter" regressors, and like last time, elect the sine/cosine transformation in lieu of raw integers for these regressors. We also add a "year" variable and time trend.
Some of the series may be stationary; some may not be. Instead of examining each one, we will trust the Augmented Dickey Fuller test implicitly about whether to difference a series or maintain its level.
For our holdout and forecast periods we use:
- A 26-week test length
- A 13-week validation length (for tuning models)
- A 52-week forecast length
In practice, the model preprocessing looks like this:
Now, to forecast. We use the 9 models specified in the first line of the code below. We also use two combination models: a weighted average of all models and a simple average of the top-5 performing models according to the validation process. Don’t forget to place a Grids.py file in your working directory.
We now have results and there are four error/accuracy metrics available to compare for each model: MAE, RMSE, MAPE, and R2. All are available in-sample as well as on test-set data and on level and non-level results.
MAPE offers an easily interpretable metric that is good for comparing across time series. An RMSE of 1000 can mean vastly different things depending on the range of data we are forecasting with; MAPE is a percentage point difference of actual data vs. predictions, regardless of scale. Its downside is it doesn’t evaluate when there are 0s in the actual data and can become misleading when dealing with very small values. For this reason, we elect to use the LevelTestSetMAPE
metric to make comparisons, which is evaluated fairly across all avocado series. It is interesting to see how results change with different metrics, however, and you are encouraged to do that on your own.
Now that we have chosen an error metric, let’s write all model statistics to a CSV file:
We uncover the following information:
- The median MAPE metric across all series is 0.24
- Among models designated the "best" for any given series, the median MAPE is 0.16
- Some models performed very badly on test-set data, causing a heavy right skew in the error distribution
- On average, Prophet and Silverkite models returned the lowest MAPE
- Some models suffer from exponential trend modeling, which can cause their MAPEs to be very high
- Transformations can be a way around this.
- The simple average combination model returned the lowest median MAPE
- The mlr model was most often designated the best on a given series (23 times)
- Every utilized model was designated the best at least three times
Originally, this blog post had more results reporting in this section, but with updates to scalecast, I believe there are better way to express these results now. Please see the new introductory notebook and its section on scaled automated forecasting for a similar example to the one shown here.
Conclusion
We applied 11 models on 108 series for 1,188 total forecasts. We compared model performances and demonstrated a scalable approach to forecasting that uses minimal code, is flexible and dynamic, and produces easily examined results. Thank you for following along and look out for part 3!