Generating Confidence Intervals for Regression Models

An explanation and study of several model-agnostic methods

Published in

Towards Data Science

7 min readJan 17, 2022

It is very common for a data scientist to develop regression models to predict some continuous variable on its daily job. What may not be so common, especially when we are learning about regression methods, is how to define a confidence interval for a given prediction of our model.

Of course, there are models that have a built-in way of defining this interval, and when dealing with classification, most models will output a probability that can help us deal with uncertainty on prediction.

But what if we have a regression model that has no built-in way of outputting that? We need an agnostic way of generating confidence intervals for regressors, and this is what we are going to explore in this post.

For this, we will study the paper Predictive Inference with Jackknife+ [1] that explains several methods of generating those intervals and their advantages and disadvantages. We will review the main methods and code them up to better consolidate the concepts.

The code for this post is also available on Kaggle and on Github.

The Naive Method

The naive method may be the first thing that comes to mind when we are trying to generate confidence intervals. The idea is to use the residuals of our model to estimate how much deviation we can expect from new predictions.

The algorithm goes as follows:

Train the model on the training set
Calculate the residuals of the predictions on the training set
Select the (1 — alpha) quantile of the distribution of the residuals
Sum and subtract each prediction from this quantile to get the limits of the confidence interval

One expects that, since the distribution of the residuals is known, the new predictions should not deviate much from it.

However, this naive solution is problematic because our model may overfit and even if it doesn’t, most of the time the error on the training set will be smaller than the error on the test set, after all, those points are known by the model.

This may lead to over-optimistic confidence intervals. Therefore, this method should never be used.

To better understand it, let’s code up this method. For this and every other method on the post, we will be using the Diabetes dataset [2], freely available on the sklearn package for commercial use, which can also be found here. We will apply a Random Forest Regressor to create the predictions.

First, let’s import the required libraries:

import numpy as np
import pandas as pdfrom sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold

Now, let’s import the dataset, define the alpha, and split it into training and test sets:

X, y = load_diabetes(return_X_y=True)
alpha = 0.05
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.1)

We will now define a function that will generate the resulting dataset for us:

def generate_results_dataset(preds, ci):
    df = pd.DataFrame()
    df['prediction'] = preds
    if ci >= 0:
        df['upper'] = preds + ci
        df['lower'] = preds - ci
    else:
        df['upper'] = preds - ci
        df['lower'] = preds + ci
        
    return df

This function receives the predictions and the CI (the 1 — alpha quantile of the distribution of residuals) and generates the lower and upper bounds for our predictions.

Now let’s generate the naive method results. For this, we will fit the regressor and calculate the training residuals, then we will get the quantile and the predictions on the test set to use our function:

rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)residuals = y_train - rf.predict(X_train)ci = np.quantile(residuals, 1 - alpha)
preds = rf.predict(X_test)df = generate_results_dataset(preds, ci)

One can visualize the predictions intervals, as well as the prediction and real value, using a function available on the notebook on Kaggle.

The Jackknife Method

The Jackknife method tries to overcome the over-optimistic result from the naive method. To do so, it does not use the residuals from the training set, instead, it uses the residuals from the leave-one-out predictions.

For those who don’t know, the leave-one-out method consists of training a model for each data point in our dataset, training it on the entire dataset but removing one sample at a time.

By doing this procedure, the residuals are calculated each one on the leave-one-out prediction. As you can see, this is a prediction on the test set for every point, since each model didn’t see that point in the training phase.

Then, to predict the result (or the middle point of the confidence interval), one fits the last model on the entire training set and uses it to make the prediction.

Let’s see how this looks like in code:

kf = KFold(n_splits=len(y_train)-1, shuffle=True, random_state=42)
res = []
for train_index, test_index in kf.split(X_train):
    X_train_, X_test_ = X_train[train_index], X_train[test_index]
    y_train_, y_test_ = y_train[train_index], y_train[test_index]
    
    rf.fit(X_train_, y_train_)
    res.extend(list(y_test_ - rf.predict(X_test_)))

For this, we will use the KFold class from the sklearn library, using the number of splits to be equal to the number of instances minus 1, this will generate the splits for the leave-one-out procedure.

Then, for each model trained, we calculate the residual value and save it on a list.

Now, it is time to fit the model on the entire training set and generate the results:

rf.fit(X_train, y_train)
ci = np.quantile(res, 1 - alpha)preds = rf.predict(X_test)
df = generate_results_dataset(preds, ci)

This method works better than the naive, however, it still has some problems:

The leave-one-out procedure is highly costly and may be unfeasible on many applications
When we have a regression with the number of dimensions approximating the number of instances, this method loses coverage.

This loss of coverage is weird, isn’t it? The intuition behind this happening is that the model fit on the entire training set has used one more point than every other model used to generate the residuals. This makes them not directly comparable.

The Jackknife+ Method

The Jackknife+ method tries to solve the loss of coverage from the Jackknife method that happens because of the model fitted on the entire training set.

To do so, it will use all the leave-one-out trained models to generate the prediction. With this, the model making the predictions for the middle point of the interval will be comparable to the residuals.

But how will it do that? According to the paper, the intuition is that we are getting the median of the predictions for all the models. One can, however, use a quantile of the predictions if the desire is to be more or less conservative on this prediction.

However, in practical terms, notice that usually, a model will not change a lot its behavior with the loss of a single point, so it is usual for the distribution to have little variance which makes this method yield very similar results to the Jackknife. The advantage of this method comes from degenerated cases where one point may wildly change the model.

Let’s code that up:

kf = KFold(n_splits=len(y_train)-1, shuffle=True, random_state=42)
res = []
estimators = []
for train_index, test_index in kf.split(X_train):
    X_train_, X_test_ = X_train[train_index], X_train[test_index]
    y_train_, y_test_ = y_train[train_index], y_train[test_index]
    
    rf.fit(X_train_, y_train_)
    estimators.append(rf)
    res.extend(list(y_test_ - rf.predict(X_test_)))

The code here is very similar to the one we used on the Jackknife method, however, this time we are also saving the models we used to generate the residuals.

Now, our prediction will be a little different, because we will need to predict with every model:

y_pred_multi = np.column_stack([e.predict(X_test) for e in estimators])

Now, let’s calculate the lower and upper bounds:

ci = np.quantile(res, 1 - alpha)
top = []
bottom = []for i in range(y_pred_multi.shape[0]):
    if ci > 0:
        top.append(np.quantile(y_pred_multi[i] + ci, 1 - alpha))
        bottom.append(np.quantile(y_pred_multi[i] - ci, 1 - alpha))
    else:
        top.append(np.quantile(y_pred_multi[i] - ci, 1 - alpha))
        bottom.append(np.quantile(y_pred_multi[i] + ci, 1 - alpha))

And finally, let’s generate the results using the median prediction:

preds = np.median(y_pred_multi, axis=1)
df = pd.DataFrame()
df['pred'] = preds
df['upper'] = top
df['lower'] = bottom

Now, this method does not solve the problem of the time taken to generate the confidence interval. It reduces only one fitting procedure but adds the overhead of predicting with every model.

The CV+ Method

The CV+ method tries to solve the problem of the time needed to generate the intervals. It is exactly the same as the Jackknife+, however, it uses a K-Fold Cross-validation instead of a leave-one-out.

The paper simulations show that this method is a little worse than the Jackknife+, however, it is way faster. In practical terms, this will probably be the method being used in most cases.

Its implementation is the same as the Jackknife+, we just need to change the for loop to use a number of folds that is smaller than the length of the dataset:

kf = KFold(n_splits=5, shuffle=True, random_state=42)

This simple change will implement the CV+ method.

Conclusion

This new tool may be useful for many data scientists out there in need of generating prediction intervals for their regression models. Also, these methods are open-sourced on the MAPIE library.

The Jackknife+ is too expensive to be used. However, if you already use cross-validation on your methods, you can apply the CV+ method without overhead or loss of performance.

[1] Barber, Rina & Candès, Emmanuel & Ramdas, Aaditya & Tibshirani, Ryan. (2021). Predictive inference with the jackknife+. Annals of Statistics. 49. 486–507. 10.1214/20-AOS1965.

[2] Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407–499