Provide Prediction Uncertainty Estimations in any Machine Learning Pipeline

Everyone involved in a forecasting task starts to build a model to produce point forecasts. In most prediction applications, this is related to the need to provide forecasts that tend to anticipate the future most accurately. Making these kinds of predictions is very useful in every business scenario because, given the expected future value of some KPIs, it’s possible to plan an adequate strategy to maximize the targets.
Being accurate is not all. A model fitted (and properly validated) to produce future mean predictions is simply saying to us: ”what value we can expect to see the next few days”. Nevertheless, the latest can be a piece of great information, it may be not enough for the business to make a correct decision.
Prediction intervals come to our rescue to enrich our forecasting report. They simply say to us: "where we can expect to see the values for the next few days". They provide an upper and lower limit, associated with the point forecast (which lies in the middle), where the true value may fall. The prediction intervals are always associated with a percentage of tolerance which grants the uncertainty magnitude of our estimations. In other words, we expect our prediction intervals to capture the true values X% of the time in the future.
In this post, we introduce a simple yet effective methodology to make our model produce prediction intervals. The main advantage is that we can retrieve prediction intervals for every regression model completely for free.
METHODOLOGY
The method we adopt to build the prediction intervals is mainly based on residual bootstrapping. Bootstrapping is a resampling technique widely used in statistics and Machine Learning to approximate unknown quantities. The residual bootstrapping, as the name suggests, consists of sampling with replacement from a residual distribution. The residual distribution involved is obtained as the raw difference between the targets and the predictions on our train/validation set.
For example, to obtain a 95% prediction interval for our forecasts, we have to sample with replacement from the residual distribution and extract the 0.025 and 0.975 quantiles from it. Simply adding these bootstrapped statistics to the point forecasts, we can obtain our prediction intervals.
As introduced above, the procedure is extremely simple but for this reason, it must not be undervalued. There are some hidden pitfalls that, if not properly considered, can vanish our work.
First of all, the residual distributions must be build on unseen data. Only in this case, we have the guaranty to approximate unknown behaviors and grant a reliable uncertainty interpretation for our forecasting intervals. The best practice is to make a fit through cross-validation and estimate the residual distributions on the validation folds.
Secondly, we have to choose an adequate model according to the hidden dynamics of the system. The development of a wrong model will impact the residual bootstrapped statistics, i.e. misleading prediction intervals. This latest aspect is analyzed deeper in the artificial examples below, where we try to build forecasting intervals in the presence of two different regimes.
STATIONARY DATA
Given a stationary system composed of 3 variables Y, X1, and X2; we try to predict Y one step ahead in the future together with the corresponding prediction intervals.

We start making a standard feature engineering, creating some rolling features. Then we find the best parameters for a Ridge and a RandomForest with a simple grid-search. The same cross-validation strategy is used to obtain the out-of-fold residuals for the best parameter combinations on both models.


The residual distributions follow a pseudo-normal distribution and the residual autocorrelation doesn’t show a significative pattern. These are indicators of the goodness of fit of our models.
If we want to build 95% prediction intervals for our models, we only have to compute the bootstrapped quantiles on the residual distributions. The results, on the unseen test data, confirm what we expect. Nearly 5% of the test observations fall outside the predictions bands for both models.

NOT STATIONARY DATA
We replicate the same experiment in a not stationary system composed of 4 variables Y, X1, X2, X3. Where X3 is a random walk with a negative trend. We always want to predict Y one step ahead in the future together with the corresponding prediction intervals.

We repeat the same forecasting pipeline as before, composed of feature engineering, grid-search tuning, and residual distributions computation.


The residual distributions follow a pseudo-normal distribution but the autocorrelation reveals some problems. In the case of the RandomForest, a high cyclical autocorrelation is synonymous that the model is not trained properly.
The resulting 95% prediction intervals, build on the bootstrapped residual distributions, reflect the wrong performance obtained by the RandomForest. For the Ridge, we have again nearly 5% of the test observations that fall outside the predictions bands. For the RandomForest this number is quite huge as a result of the bad fit.

We may expect this behavior. Tree-based models are not suited to deal with temporal shifts in data. A first-order differentiation may solve this. However, the takeaway is that prediction intervals are not magic. Following our procedure, we can obtain good uncertainty bands only if we correctly model the underlying system. Otherwise, bad models result in bad predictions (for both point and uncertainty forecasts).
SUMMARY
In this post, we introduced a method to produce prediction intervals as a way to provide uncertainty estimation in forecasting. The procedure is based on residual bootstrapping and can be plugged, nearly for free, in any machine learning pipeline. It is not model-dependent and also the domain of application can be extended to any regression task.
Keep in touch: Linkedin