
As important as producing a point estimate for forecasting applications is determining how far off the actual value is likely to be from the prediction. Most forecasts are not 100% accurate so having a good sense of the possibilities when dealing with model implementation becomes crucial. For models with underlying functional forms, such as ARIMA, confidence intervals can be determined using the assumed distribution of the residuals and the standard errors of the estimation. These intervals are logical in that they expand the further out from the known last value a forecast goes – as uncertainty accumulates, this becomes represented in a mathematical way that gels with our intuitions. And if the model assumptions hold, a 95% Confidence Interval is guaranteed to contain 95% of the actual values.
Conformal Prediction
However, when dealing with a Machine Learning model that has no form that can be represented with a simple equation and assumes no distribution in the underlying data, creating a sound confidence interval becomes more of a challenge. A popular solution to this problem are conformal predictions. The GitHub repository, Awesome Conformal Prediction, lists many great resources to dive into this idea more thoroughly.
One of these resources is a library that offers a conformal estimator that can be wrapped around scikit-learn estimators: mapie. One of the best ways I’ve seen this method applied for time series is with the tspiral library overviewed here. However, the mapie conformal prediction can only take you so far if the time series in question has been differenced to achieve stationarity before the scikit-learn estimator is applied. The conformal approach can be applied to the series at the differenced level, but what happens when we want to revert to the original level? If we simply undifference the confidence intervals the same way we undifference the point estimates, the resulting intervals will usually be too wide.
That’s where scalecast can come in. It uses a type of "naive" conformal prediction, where a test set is used to find a percentile range to apply to predictions over an unknown horizon. When the underlying series has been differenced, both the test-set actuals and predictions are undifferenced and a percentile function is applied to the out-of-sample residuals to find the likely coverage of the predictions. It’s not totally scientific – for instance, there is no effort made to correct autocorrelation in the resulting residuals. But I believe it works, and through measuring its effectiveness with an empirical metric, perhaps I can convince you of the same.
MSIS
Mean Scaled Interval Score was introduced by Gneiting & Raftery in 2007 to measure the effectiveness of confidence intervals. Lower scores are better. Makridakis et. al (2020) used MSIS to evaluate confidence intervals by submissions to the M4 competition. This is what they write about it:
The following algorithm illustrates how the MSIS is estimated in practice and highlights its logic when it is used for comparing the precisions of the intervals generated by two different forecasting methods:
• A penalty is calculated for each method at the points at which the future values are outside the specified bounds. This captures the coverage rate of each method.
• The width of the prediction interval is added to the penalty, if any, to get the interval score (IS). In this respect, the methods of larger intervals are penalized over those of smaller ones, regardless of the coverage rate achieved.
• The ISs estimated at the individual points are averaged to get the mean interval score (MIS).
• MIS is scaled by dividing its value by the mean absolute seasonal difference of the series, as is done for the case of the MASE used in M4, in order to make the measure scale-independent.
We can use MSIS to measure the effectiveness of the scalecast intervals on machine learning models and benchmark these models against a more traditional and trusted time series model with an underlying functional form – ARIMA. To make the analysis more comprehensive, we will try three different datasets:
- Daily Visitors: stationary, large, aggregated at the daily level, and fairly easy to predict
- Housing Starts: large, monthly, and will be first differenced to achieve stationarity
- Avocados: small, weekly, and will both first differenced and first seasonally differenced
All datasets can be shared publicly. You can see the full notebook containing the analysis here.
Code Syntax
The following installations are needed to run the code:
pip install --upgrade scalecast
pip install tqdm
To be somewhat brief, I will only share the notebook code that uses the Avocados dataset. First, library imports and loading the data:
import pandas as pd
import numpy as np
from scalecast.Forecaster import Forecaster
from scalecast import GridGenerator
from scalecast.util import metrics, find_optimal_transformation
from scalecast.notebook import tune_test_forecast
from scalecast.SeriesTransformer import SeriesTransformer
import matplotlib.pyplot as plt
import seaborn as sns
import time
from tqdm.notebook import tqdm
avocados = pd.read_csv('avocado.csv',parse_dates = ['Date'])
volume = avocados.groupby('Date')['Total Volume'].sum()
Let’s split the data to make sure everything is fairly tested out-of-sample:
val_len = 20
fcst_len = 20
volume_sep = volume.iloc[-fcst_len:]
volume = volume.iloc[:-fcst_len]
This dataset is small, and after differencing and using autoregressive terms to make predictions, the amount of observations to work with becomes even smaller. Therefore, we use 20 observations in the test set and the same in a validation set to construct the intervals. Twenty observations is the minimum needed to reliably form 95% confidence intervals, which is what we will be using. Let’s now create a Forecaster object:
f = Forecaster(
y = volume,
current_dates = volume.index,
future_dates = fcst_len,
test_length = val_len,
validation_length = val_len,
cis = True, # adjust the width using the cilevel attribute
)
We apply a first difference, then a first seasonal difference to the data, assuming a 52-period cycle:
transformer = SeriesTransformer(f)
f = transformer.DiffTransform(1)
f = transformer.DiffTransform(52) # seasonal differencing
We now automatically select covariates to apply to the forecasts using auto_Xvar_select():
f.auto_Xvar_select(
estimator='elasticnet',
alpha=.2,
max_ar=26,
monitor='ValidationMetricValue', # not test set
decomp_trend=False,
)
f
By calling the object instance, we see what was selected:
Forecaster(
DateStartActuals=2016-01-10T00:00:00.000000000
DateEndActuals=2017-11-05T00:00:00.000000000
Freq=W-SUN
N_actuals=96
ForecastLength=20
Xvars=['AR1', 'AR2', 'AR3', 'AR4', 'AR5', 'AR6', 'AR7', 'AR8', 'AR9']
Differenced=0
TestLength=20
ValidationLength=20
ValidationMetric=rmse
ForecastsEvaluated=[]
CILevel=0.95
CurrentEstimator=None
GridsFile=Grids
)
In this case, it only chose 9 autoregressive terms/series lags. It could have also selected trend and seasonal covariates, but the algorithm didn’t think those would improve the models’ accuracy. The series has already been differenced and seasonally adjusted, so that is not surprising. Now, we choose our machine learning models and evaluate one forecast with each of them:
models = (
'mlr',
'elasticnet',
'ridge',
'knn',
'xgboost',
'lightgbm',
'gbt',
) # these are all scikit-learn models or APIs
tune_test_forecast(
f,
models,
dynamic_testing = fcst_len,
)
Finally, we revert the forecasts to the original series level and plot the results:
# revert differencing
f = transformer.DiffRevert(52)
f = transformer.DiffRevert(1)
fig, ax = plt.subplots(figsize=(12,6))
f.plot(ci=True,models='top_1',order_by='TestSetRMSE',ax=ax)
sns.lineplot(
y = 'Total Volume',
x = 'Date',
data = volume_sep.reset_index(),
ax = ax,
label = 'held-out actuals',
color = 'green',
alpha = 0.7,
)
plt.show()

Both the predictions and actuals with the top-performing model, KNN, look fairly well-fit. However, one model on one series doesn’t ever tell the whole story. Let’s look at the results from all applied models on all three of the chosen datasets.
All Results
A comparison of MSIS scores on all three time series shows the dispersion of the scores:

What do these figures actually mean? It’s hard to say. MSIS isn’t a metric many are accustomed to using, so these scores only make sense when benchmarked against more familiar approaches. Using an auto-ARIMA process, we now score the more standard intervals from the StatsModels package – the type of intervals that have an underlying functional form and assumed distribution. We can also use the same ARIMA models but apply a conformal interval to them to complete our benchmark. The final results look like this:

We see green where the scalecast interval outperformed the StatsModel ARIMA interval and red otherwise. The good news is we see a lot of green, validating the naive conformal interval approach. It’s not all good, however. The ARIMA models that received a conformal interval did not do as well as the benchmark on the whole. Also, only three of the seven ML models outperformed ARIMA on the Housing Starts dataset. That is all worth looking into further, but at the very least, we can be pleased that using machine learning models on diverse datasets with and without differencing usually performed better than the more classical ARIMA approach.
Conclusion
This post overviewed conformal intervals applied using the scalecast package. The MSIS interval score was applied to seven machine learning models across three diverse datasets and were then benchmarked against ARIMA. We saw the machine-learning approach outperformed the ARIMA approach in most, but not all, instances. Thank you for following along and be sure to give scalecast a star on GitHub!
GitHub – mikekeith52/scalecast: The practitioner’s forecasting library
References
Tilmann Gneiting & Adrian E Raftery (2007) Strictly Proper Scoring Rules, Prediction, and Estimation, Journal of the American Statistical Association, 102:477, 359–378, DOI: 10.1198/016214506000001437
Makridakis, Spyros & Spiliotis, Evangelos & Assimakopoulos, Vassilios, 2020. "The M4 Competition: 100,000 time series and 61 forecasting methods," International Journal of Forecasting, Elsevier, vol. 36(1), pages 54–74.