Several methods exist to find the best model specification for time series, depending on the model being employed. For ARIMA models, a popular method is to monitor information criteria while searching through different AR, I, and MA orders. This has proven to be an effective technique and popular libraries in R and Python offer auto-ARIMA models for users to experiment with. Similar methods can be used for other classically statistical time series methods, such as Holt-Winters Exponential Smoothing and TBATS.
For Machine Learning models, it can be slightly more complicated, and other than complex deep learning models (such as N-Beats, N-HiTS, and a few others), there aren’t many automated pure ML methods that consistently out-perform the classical models (Makridakis et al., 2020).
The Python library scalecast offers a function called auto_Xvar_select()
that can be used to automatically select the best trend, seasonality, and look-back representations (or lags) for any given series using models from scikit-learn.
pip install --upgrade scalecast
The function works by first searching for the ideal representations of the time series’ given trend, then seasonality, then look-back, all separately. "Ideal" in this context means minimizing some out-of-sample error (or maximizing R2) with a selected model (multiple linear regression, or MLR, by default). After each of these has been found separately, the ideal combination of all of the above representations is searched for, with the option to consider irregular cycles and other regressors as the user sees fit.
It is an interesting function. When applied to the 100,000 series from the M4 competition, it returns results with varying accuracy, depending on the series’ frequency. For the hourly frequency group, an OWA of under 0.6 using each of the KNN, LightGBM, and XGBoost models is obtained, where the representations for these models were searched for with the default MLR model. For context, this means that these models can be expected to outperform a naïve model with seasonal adjustments by over 40% (1 – 0.6) on average. This is a very solid result, on par with what sktime published using pure ML methods on the same series (Loning et al., 2019).
# evaluate the hourly series
for i in tqdm(Hourly.index):
y = Hourly.loc[i].dropna()
sd = info.loc[i,'StartingDate']
fcst_horizon = info.loc[i,'Horizon']
cd = pd.date_range(
start = sd,
freq = 'H',
periods = len(y),
)
f = Forecaster(
y = y,
current_dates = cd,
future_dates = fcst_horizon,
)
f.set_test_length(fcst_horizon)
f.integrate(critical_pval=.99,max_integration=1)
f.set_validation_length(fcst_horizon)
f.set_validation_metric('mae')
if len(f.y) > 300:
f.auto_Xvar_select(
monitor='LevelTestSetMAE',
max_ar = 48,
exclude_seasonalities = [
'quarter',
'month',
'week',
'day',
]
)
f.determine_best_series_length(
monitor='LevelTestSetMAE',
step=50,
min_obs = 300,
)
else:
f.auto_Xvar_select(
monitor='LevelTestSetMAE',
irr_cycles = [168], # weekly
exclude_seasonalities = [
'quarter',
'month',
'week',
'day',
'dayofweek',
],
max_ar = 24,
)
f.tune_test_forecast(
models,
error='ignore',
)
See the notebook that was used to run all models with scalecast on the M4 series here, and the notebook to evaluate each model’s performance in the same process here.
My question is if using estimators different from the default in the auto_Xvar_select()
function can lead to consistently better results. Searching for model specifications in this way can be time-consuming, but using the MLR to do it usually doesn’t slow things down too much, which is why it is the default in that function.
Unfortunately, this is a long process to fully research. To simplify the process, I limited the models to:
- Multiple Linear Regression (MLR)
- ElasticNet
- Gradient Boosted Trees (GBT)
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVR)
- Multi-Level Perceptron (MLP)
Each of these models was used to find representations and to forecast, with all combinations tried. I also only used default parameters for each model, although I believe tuning them with a grid search could significantly improve performance. Finally, I only used a sample of 50 randomly selected out of the 414 series from the hourly group. Even with these modifications, the process took over 17 hours to run on my Windows computer. The resulting forecasts were evaluated by their respective average Symmetric Mean Absolute Percentage Error (SMAPE) performance, which is one of the metrics used in the M4 competition to evaluate the models. Find the full notebook here.
for i in tqdm(Hourly.index):
y = Hourly.loc[i].dropna()
sd = info.loc[i,'StartingDate']
fcst_horizon = info.loc[i,'Horizon']
cd = pd.date_range(
start = sd,
freq = 'H',
periods = len(y),
)
f = Forecaster(
y = y,
current_dates = cd,
future_dates = fcst_horizon,
)
f.set_test_length(fcst_horizon)
f.integrate(critical_pval=.99,max_integration=1)
for xvm in models:
for fcstm in models:
f2 = f.deepcopy()
f2.auto_Xvar_select(
estimator = xvm,
monitor='LevelTestSetMAE',
max_ar = 48,
exclude_seasonalities = [
'quarter',
'month',
'week',
'day'
],
)
f2.set_estimator(fcstm)
f2.proba_forecast(dynamic_testing=False) if fcstm in (
'mlp','gbt','xgboost','lightgbm','rf'
) else f2.manual_forecast(dynamic_testing=False)
point_fcst = f2.export('lvl_fcsts')[fcstm]
results.loc[xvm,fcstm] += metrics.smape(
Hourly_test.loc[i].dropna().to_list(),
point_fcst.to_list(),
)
This led to a few interesting points that are worth observing:
- The KNN and GBT models consistently outperformed the others when it came to measuring actual forecasting accuracy, regardless of which model was used to search for the optimal series representations. This is not surprising as these were the best-performing classes of model over the entire M4 hourly series.
- The worst-performing models were SVR, ElasticNet, and MLP and the difference between the best model on average from the worst was 60.8 (!!) percentage points.
- The best models at finding the ideal series representations were ElasticNet and MLP with only 4 percentage points separating the best from the worst models in this regard.
- The best performing model combination on average was the KNN to make forecasts and MLP to find the series representation.
Therefore, it would appear from this experiment that the model used to make the forecast is more important to obtaining high accuracy than the model used to search for the ideal representations. Although, it also appears that the best models in both aspects are reversed from one another. The best models at finding representations were worse at making forecasts and vice-versa. It could be that weak estimators are more reliant on finding the ideal trend, seasonality, and look-back to even have a chance at making good predictions, so it is a good idea to combine weaker estimators to find the best representations with stronger estimators to actually make the forecasts. But, that’s just an idea and more research would be needed to bear it out.
Conclusion
It is difficult to find a fully automated technique with machine learning for forecasting that consistently outperforms classical statistical methods. Many times, the best results come from combining different ML estimators. In this case, I found that using a weaker model to find the ideal trend, seasonality, and look-back representation for a given series and combining that representation with a stronger model to make a forecast generally leads to the best results on the sample of hourly series I tested. All of these methods were tested with out-of-sample data and the scalecast package offers a very straightforward interface to perform such an analysis.
Links
- Scalecast: Github / Read the Docs
- M4 models with scalecast: GitHub
- M4 model evaluation with scalecast: GitHub
- Auto Xvar Experiment with 50 series using scalecast: GitHub / Read the Docs
Works Cited
Markus Loning, Anthony J. Bagnall, Sajaysurya Ganesh, Viktor Kazakov, Jason Lines, and Franz J. Kiraly. sktime: A unified interface for machine learning with time series. CoRR, abs/1909.07872, 2019. URL http://arxiv.org/abs/1909.07872.
Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 Competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020. doi: 10.1016/j.ijforecast.2019. URL https://ideas.repec.org/a/eee/intfor/v36y2020i1p54–74.html.