Metrics for uncertainty evaluation in regression problems

How to evaluate uncertainty with Validity, Sharpness, Negative Log-Likelihood, and Continuous Ranked Probability Score (CRPS) metrics

Published in

Towards Data Science

10 min readAug 12, 2022

Many real-world problems require uncertainty estimation for future outcomes for better decision-making. However, most state-of-the-art machine learning algorithms are capable of estimating only a single-valued prediction which is usually a mean or a median of the conditional distribution which suppose to match the real outcome well. What the single-valued prediction cannot suggest is how confident the prediction is. Therefore, more complex approaches have been recently proposed such as NGBoost, PGBM, Bayesian Methods, and Deep Learning-based that can estimate conditional distribution for each data point. Since these methods estimate the conditional distribution, arises the question of how can we evaluate such methods. Clearly, the commonly used accuracy metrics such as RMSE or MAE for single-point prediction won’t suit. Therefore, in this article, I show several approaches to evaluate models that estimate the conditional distribution and how to evaluate Quantile Regressions that do not estimate conditional distribution directly but supply prediction intervals. Specifically, I evaluate three algorithms:

LightGBM with Quantile Regression
NGBoost — a gradient boosting algorithm with probabilistic prediction
Probabilistic Regression Neural Network that models mean and standard deviation

and use four metrics for uncertainty evaluation:

Validity— evaluation of the reliability of quantiles and bias in a probabilistic context
Sharpness — estimating concentration of probabilities (prediction intervals)
Negative Log-Likelihood (NLL) — the likelihood for the observed data to occur given the inferred parameters of the conditional distribution
Continuous Ranked Probability Score (CRPS) — estimating how close the conditional distribution is to the observed point

Data

I am using the California housing dataset provided by scikit-learn. This is a small dataset with 20640 records and only 8 numerical features. The target variable is the median house value for California districts, in hundreds of thousands of dollars.

from sklearn.datasets import fetch_california_housingcalifornia_housing = fetch_california_housing(as_frame=True)#get the data
data = california_housing.frame
data

#get the target
target = california_housing.target
target

Target variable snapshot. Image by the author

As usual, the data is split into the train and test sets. I will fit the models to the train set and evaluate them on the test set. Since the focus of this article is evaluation metrics, I am not performing any hyperparameter tuning and applying the default parameters of the algorithms.

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42, shuffle=True)

I standardize the features for the Deep Learning solution

x_scaler = StandardScaler()
X_train_scaled = x_scaler.fit_transform(X_train)
X_test_scaled = x_scaler.transform(X_test)

Algorithms

I’ve picked three algorithms for evaluation.

LightGBM with Quantile Regression

LightGBM, a gradient boosting algorithm, is widely used in the machine learning community. Probably, the most straightforward way to get prediction intervals using existing algorithms is to build at least two quantile regression models to target some low and high conditional quantiles. For example, a 90% prediction interval would require fitting two quantile regressions with 5% and 95% quantile levels. Fortunately, LightGBM implements quantile regression but any other algorithms supporting quantile regression would work too.

I define 13 quantile levels and build 14 models. The one which predicts the expected mean of median prices and the rest 13 models predict median prices at given quantile levels:

quantiles = [0.05, 0.1, 0.15, 0.2, 0.3, 
             0.4, 0.5, 0.6, 0.7, 0.8, 
             0.85, 0.9, 0.95]#save quantile predictions
quantile_predictions = {}train_data = lgb.Dataset(X_train, 
                         label=y_train, 
                         free_raw_data=False)#first model that predicts expected mean
params = {'objective': 'regression'}
lgb_model = lgb.train(params=params, 
                      train_set=train_data, 
                      num_boost_round=100)lgb_prediction = lgb_model.predict(X_test)#train models on quantiles
for quantile in quantiles:
    print(f"modeling quantile {quantile}")
    params = {'objective': 'quantile', 'alpha': quantile}
    lgb_model = lgb.train(params=params, 
                          train_set=train_data, 
                          num_boost_round=100)
    pred = lgb_model.predict(X_test)
    
    quantile_predictions[quantile] = pred

NGBoost

NGBoost is a gradient-boosting algorithm that estimates parameters of the conditional (normal) distribution.

ngb = ngboost.NGBoost(Base=learners.default_tree_learner, Dist=distns.Normal, Score=scores.LogScore, natural_gradient=True, verbose=True)
ngb.fit(X_train, y_train)#predicted mean
ngb_mean_pred = ngb.predict(X_test)#predicted distribution parameters
ngb_dist_pred = ngb.pred_dist(X_test)

Regression Neural Network

Regression Neural Network estimates parameters of the conditional (normal) distribution using TensorFlow Probability framework.

Below is the architecture of the neural network:

The loss function is defined as a negative log-likelihood of the underlying normal distribution:

def nll_loss(y, distr):
    return -distr.log_prob(y)

The following function receives an input consisting of two nodes one for the mean and the second one for the standard deviation:

def model_distribution(params): 
    return tfd.Normal(loc=params[:,0:1],   
                      scale=tf.math.softplus(params[:,1:2]))

The mean and standard deviation are modeled separately.

The mean:

inputs = layers.Input(shape=((len(X_test.columns),)))hidden1 = layers.Dense(100, activation = "relu", name = "dense_mean_1")(inputs)
hidden2 = layers.Dense(50, activation = "relu", name = "dense_mean_2")(hidden1)
output_mean = layers.Dense(1, name = "mean_output")(hidden2)

The standard deviation:

hidden1 = layers.Dense(100,activation="relu", name = "dense_sd_1")(inputs)
hidden1 = layers.Dropout(0.1)(hidden1)
hidden2 = layers.Dense(50,activation="relu", name = "dense_sd_2")(hidden1)
hidden2 = layers.Dropout(0.1)(hidden2)
hidden3 = layers.Dense(20,activation="relu", name = "dense_sd_3")(hidden2)
output_sd = layers.Dense(1, name = "sd_output")(hidden3)

Both are concatenated and propagated into the distribution layer:

mean_sd_layer = layers.Concatenate(name = "mean_sd_concat")([output_mean, output_sd]) 
dist = tfp.layers.DistributionLambda(model_distribution)(mean_sd_layer)

The final steps are to compile the distribution model and to use the distribution model to predict mean and standard deviations separately.

dist_mean = tfp.layers.DistributionLambda( make_distribution_fn=model_distribution, convert_to_tensor_fn=tfp.distributions.Distribution.mean)(mean_sd_layer)
dist_std = tfp.layers.DistributionLambda( make_distribution_fn=model_distribution, convert_to_tensor_fn=tfp.distributions.Distribution.stddev)(mean_sd_layer)model_distr = models.Model(inputs=inputs, outputs=dist)
model_distr.compile(optimizers.Adagrad(learning_rate=0.001), loss=nll_loss)#the model for mean
model_mean = models.Model(inputs=inputs, outputs=dist_mean)#the model for std
model_sd = models.Model(inputs=inputs, outputs=dist_std)history = model_distr.fit(X_train_scaled, y_train, 
                          epochs=150, 
                          verbose=1, 
                          batch_size = 2**7, 
                          validation_data=(X_test_scaled,y_test))#mean predictions
dl_mean_prediction  = model_mean.predict(X_test_scaled).reshape(-1)
dl_mean_prediction#stand deviation predictions
dl_sd_prediction = model_sd.predict(X_test_scaled).reshape(-1)
dl_sd_prediction

Metrics

The four metrics described below are the most commonly used metrics in research. Methods that are based on direct estimation of prediction intervals like quantile regressions or conformal quantile regressions most often use coverage and interval length (sharpness) metrics (Paper), methods that estimate conditional distribution are using Negative Log-Likelihood (NLL) (Paper 1, Paper 2) or CRPS (Paper)

Validity / Coverage

For a quantile prediction with some level 0 < α < 1, we expect that (100*α)% of observations are covered. For example, if α is 0.8, we expect that quantile predictions cover 80% of observations.

To make it more intuitive, let’s generate some data using normal distribution and check the proportion of data that is covered by some quantile level, say 10% or α is 0.1

r = stats.norm.rvs(size=1000)

We need to find out the Z-score of the 10% quantile:

quantile = 0.1
percent_point = stats.norm.ppf(q = quantile)
#-1.281552

Now let’s find the empirical proportion of cases where 10% quantile covers the generated observations:

len(r[r < percent_point])/len(r)
#10.5%

A strong indication for non-valid conditional quantiles is when the empirical quantiles deviate too much from the nominal ones.

Let’s calculate validity for LightGBM Quantile Regression:

empirical_quantiles = []
for quantile in quantiles:
    empirical = (quantile_predictions[quantile] >= y_test).mean()
    empirical_quantiles.append(empirical)

and visualize the results:

plt.figure(figsize=(16, 10))
sns.set_context("notebook", font_scale=2)
sns.lineplot(x = quantiles, y = quantiles, color = "magenta", linestyle='--', linewidth=3, label = "ideal")
sns.lineplot(x = quantiles, y = empirical_quantiles, color = "black", linestyle = "dashdot", linewidth=3, label = "observed")
sns.scatterplot(x = quantiles, y = empirical_quantiles, marker="o", s = 150)
plt.legend()
plt.xlabel("True Quantile")
plt.ylabel("Empirical Quantile")
_ = plt.title("Reliability diagram: assessment of quantile predictions")

The empirical quantiles (black line) are very close to the ideal quantiles (magenta line) which indicate that quantile predictions generated by LightGBM are valid.

Sharpness / Interval Length

Perfect probabilistic predictions should estimate 100% of the probability for a single value. In reality, a conditional distribution will allocate different probabilities for each possible outcome. Sharpness is a measure of how tight the predictive densities are. The narrower the interval length is, the more confident we can be about the prediction.

The procedure for sharpness calculation is following:

Choose the prediction interval for evaluation
Iterate through the predictions of the mean and the standard deviation of each observation (NGBoost and Neural Network)
For each observation in the test set, calculate the boundaries of the interval
Calculate the length of the interval
Calculate the average of all intervals

To make it more intuitive, let’s take an example of two predictions:

prediction 1: mean = 0.5, std = 0.08
prediction 2: mean = 0.5, std = 0.15

and compare the width of their corresponding prediction intervals covering 60% of the probabilities:

The dashed green lines are the boundaries of the interval. 60% of probabilities are concentrated between the two lines. The second prediction has higher prediction uncertainty than the first prediction, whose interval is narrower.

Knowing the parameters of the distribution we can calculate the interval width using the scipy package:

probability = 0.6
interval = stats.norm.interval(probability, loc = 0.5, scale=0.08)
interval
#(0.43267030131416684, 0.5673296986858332)width = interval[1] - interval[0]
#0.135

In the case of quantile regression, the procedure is a little bit different. We have predictions for each quantile so in order to calculate the interval width corresponding to 60%, we take two quantiles at 20% and 80% and calculate the interval between these two quantiles using the difference in predictions.

And finally, we can compare three models using the sharpness metric at different intervals:

As the interval coverage increases from 60% to 90% the average prediction interval of the LightGBM Quantile Regression increases at a much higher rate than by other algorithms. For very high interval coverage (low quantile levels), the quantile regressors might not give optimal results. NGBoost shows the highest confidence in the predictive densities.

It is also possible that an algorithm is confident about the prediction but is still wrong. How do we measure this? We can count the proportion of real observations that are not within the predictive boundaries of a given interval.

The procedure is almost the same as above but instead of calculating the average interval width, we calculate the number of missing observations within the interval.

As the interval increases, the proportion of missing observations decreases but as we can see, LightGBM Quantile Regression has the highest number of biased observations followed by NGBoost.

Negative Log-Likelihood

The likelihood function describes the joint probability of the observed data as a function of the parameters of the chosen statistical model

This metric is suitable for algorithms that generate conditional probability distributions. Both NGBoost and Regression Neural Network are also using NLL as the loss function.

NLL of the normal distribution is defined as the negative log of the probability density function of the normal distribution:

The NLL is being evaluated at each observation using the parameters of the (normal) distribution. To compare NLL between algorithms we average NLLs across all observations. The lower the (average) NLL is the better the fit.

Here is the visual comparison of predictive distributions of the first observation and its corresponding NLL for Regression Neural Network and NGBoost:

Let’s see an example. The observation is 4, and the mean and standard deviation of the normal distribution are 3 and 1. What is the NLL for this observation?

#first let's calculate the likelihood#using the formula
L1 = (1/np.sqrt(2 * np.pi * 1**2)) * np.exp ( - (4 - 3)**2/(2 * 1**2))#using the scipy norm pdf function
L2 = stats.norm.pdf(4, loc = 3, scale = 1)
#0.2419707assert(L1 == L2)#NLL
-np.log(L2)
#1.418939

NLL for NGBoost can be calculated in several ways:

NGBoost provides the function pred_dist that returns an object, which provides a logpdf function that calculates a log-likelihood for each observation. To get the NLL just negate the results. The final NLL is the average of individual NLLs

ngb_dist_pred = ngb.pred_dist(X_test)
nll_ngboost = -ngb_dist_pred.logpdf(y_test)
nll_ngboost.mean()
#-4.888

The second approach is to get the parameters of each conditional distribution from the ngb_dist_pred object and calculate the NLLs manually. For example, the parameters of the conditional distribution corresponding to the first observation can be extracted as follows:

ngb_dist_pred[0].params
#{'loc': 0.47710276090295406, 'scale': 0.0022844303231773374}

Parameters of all observations can be extracted this way:

ngb_dist_pred.params
#{'loc': array([0.477,...,1.517]),
#'scale': array([0.002,...,0.002])}

In the case of the Regression Neural Network, we iterate through all observations and their corresponding distribution parameters:

#extract mean predictions
dl_mean_prediction = model_mean.predict(X_test_scaled).reshape(-1)#extract standard deviations
dl_sd_prediction = model_sd.predict(X_test_scaled).reshape(-1)#iterate over all observations
nll_dl = []
for (true_mean, mean_temp, sd_temp) in zip(y_test, 
                                           dl_mean_prediction,
                                           dl_sd_prediction):
    nll_temp = -stats.norm.logpdf(true_mean, 
                                  loc = mean_temp, 
                                  scale = sd_temp)
    nll_dl.append(nll_temp)#NLL
np.mean(nll_dl)
#-1.566

In terms of NLL, NGBoost got a lower value (-4.888) than Regression Neural Network (-1.566) so we can conclude that NGBoost has a better fit.

Continuous Ranked Probability Score

The Continuous Ranked Probability Score, known as CRPS, is a score to measure how a proposed distribution approximates the data, without knowledge about the true distributions of the data

Here is the link with a good explanation of the formula behind the CRPS. In plain words, CRPS measures the squared distance between the predictive distribution and the ideal distribution corresponding to the observed value.

I calculated CRPS using the properscoring package

#Computes the CRPS of observations x relative to normally distributed with mean, mu, and standard deviation, sig.import properscoring as ps
crps_ngboost = ps.crps_gaussian(y_test, 
                               ngb_dist_pred.params["loc"],
                               ngb_dist_pred.params["scale"]).mean()
crps_dl = ps.crps_gaussian(y_test, 
                           dl_mean_prediction,  
                           dl_sd_prediction).mean()(crps_ngboost, crps_dl)
#(0.001, 0.0266)

NGBoost has far smaller CRPS than Regression Neural Network which again suggests that NGBoost has a better fit.

Conclusion

In this article, I covered four metrics to evaluate prediction uncertainties in regression problems. Validity and sharpness are mostly suitable for methods that estimate prediction intervals directly such as quantile and conformal regression. Negative log-likelihood and CRPS are applied to compare algorithms that model conditional distributions.

The complete code can be downloaded from my Github repo

Thanks for reading!