
How confident can we be in a Machine Learning model’s prediction? This question has been a prominent area of research over the last decade, and it has major implications in high-stakes machine learning applications such as finance and healthcare. While many classification models, particularly calibrated models, come with uncertainty quantification by predicting a probability distribution over target classes, quantifying uncertainty in regression tasks is much more nuanced.
Amongst many proposed methods, quantile regression is one of the most popular because no assumptions are made about the target distribution. Until recently, the main disadvantage of quantile regression was that one model had to be trained per predicted quantile. For instance, in order to predict the 10th, 50th, and 90th quantiles of a target distribution, three independent models would need to be trained. Catboost has since addressed this issue with the multi-quantile loss function – a loss function that enables a single model to predict an arbitrary number of quantiles.
This article will explore two examples using the multi-quantile loss function on synthetic data. While these examples aren’t necessarily reflective of real-world datasets, they will help us understand how well this loss function quantifies uncertainty by predicting the quantiles of a target distribution. For a quick refresher on noise and uncertainty in machine learning, see this article:
Understanding Noisy Data and Uncertainty in Machine Learning
A follow-up to this article is now available:
Another (Conformal) Way to Predict Probability Distributions
A Brief Overview of Quantile Regression
In supervised machine learning, the usual task is to train a model that predicts the expected value of a target given a set of input features:

Notably, training a model this way produces a single prediction indicating what the model believes is the most likely value of the target given the features. In regression tasks, this is usually the mean of the target distribution conditioned on the features.
However, as illustrated in a previous article, most machine learning models are trained on noisy datasets, and simply predicting the conditional expected value of the target does not sufficiently characterize the target distribution. To see this, consider the following noisy regression dataset:

Even though the model prediction fits the data optimally, it doesn’t quantify uncertainty. For instance, when x is around 0.4, the line of best fit predicts that y is 3.8. While it is true that 3.8 is the most likely value of y when x is near 0.4, there are plenty of examples in the data where y is much higher or lower than 3.8. Said differently, the conditional distribution of the target has high variability beyond its exception, and quantile regression helps us describe this.
In quantile regression, the loss function is modified to encourage a model to learn the desired quantile of the conditional target distribution.

To gain a better intuition about this loss function, suppose a model is being trained to learn the 95th quantile of a target distribution. In the first case, consider a single training example where the target was 45 and the model predicted 60 (i.e. the model overestimated the target by 15). Assume also that each training example has a weight of 1. The loss function evaluated at these values looks like this:

Now, with the same target of 45, assume that the model predicted 30 (i.e. the model underestimated the target by 15). The value of the loss function looks much different:

Despite the model prediction being off by 15 in both examples, the loss function penalized the underestimation much higher than the overestimation. Because the 95th quantile is being learned, the loss function penalizes any prediction below the target by a factor of 0.95, and any prediction above the target by a factor of 0.05. Hence, the model is "forced" to favor overprediction above underprediction when learning the 95th quantile. This is the case when learning any quantiles above the median – the opposite is true when learning quantiles below the median. To see this better, we can visualize the loss function for each predicted quantile:



Catboost now extends this idea by allowing the base decision trees to output multiple quantiles per node. This allows a single model to predict multiple quantiles by minimizing a new loss function:

Example 1 -Simple Linear Regression
To understand how the multi-quantile loss function works, let’s start with a simple dataset. The following python code generates a synthetic linear dataset with gaussian additive noise:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from catboost import CatBoostRegressor
sns.set()
# Number of training and testing examples
n = 1000
# Generate random x values between 0 and 1
x_train = np.random.rand(n)
x_test = np.random.rand(n)
# Generate random noise for the target
noise_train = np.random.normal(0, 0.3, n)
noise_test = np.random.normal(0, 0.3, n)
# Set the slope and y-intercept of the line
a, b = 2, 3
# Generate y values according to the equation y = ax + b + noise
y_train = a * x_train + b + noise_train
y_test = a * x_test + b + noise_test
The resulting training data should look similar to this:

Next, the quantiles of the target distribution need to be specified for prediction. To illustrate the power of the multi-quantile loss, this model will seek to predict 99 different quantile values for each observation. This can almost be thought of as taking samples of size 99 from each predicted distribution.
# Store quantiles 0.01 through 0.99 in a list
quantiles = [q/100 for q in range(1, 100)]
# Format the quantiles as a string for Catboost
quantile_str = str(quantiles).replace('[','').replace(']','')
# Fit the multi quantile model
model = CatBoostRegressor(iterations=100,
loss_function=f'MultiQuantile:alpha={quantile_str}')
model.fit(x_train.reshape(-1,1), y_train)
# Make predictions on the test set
preds = model.predict(x_test.reshape(-1, 1))
preds = pd.DataFrame(preds, columns=[f'pred_{q}' for q in quantiles])
The resulting predictions DataFrame looks something like this:

Each row corresponds to a testing example and each column gives a predicted quantile. For instance, the 10th quantile prediction for the first test example was 3.318624. Since there is only one input feature, we can visualize a few predicted quantiles overlayed on the testing data:
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x_test, y_test)
for col in ['pred_0.05', 'pred_0.5', 'pred_0.95']:
ax.scatter(x_test.reshape(-1,1), preds[col], alpha=0.50, label=col)
ax.legend()

Visually, the 95th and 5th quantiles appear to do a good job accounting for uncertainty in the data. Moreover, the 50th quantile (i.e the median) roughly approximates the line of best fit. When working with predicted quantiles, one metric we’re often interested in analyzing is coverage. Coverage is the percentage of targets that fall between two desired quantiles. As an example, coverage can be computed using the 5th and 95th quantiles as follows:
coverage_90 = np.mean((y_test <= preds['pred_0.95']) & (y_test >= preds['pred_0.05']))*100
print(coverage_90)
# Output: 91.4
Using the 5th and 95th quantiles, assuming perfect calibration, we would expect to have a coverage of 95–5 = 90%. In this example, the predicted quantiles were slightly off but still close, giving a coverage value of 91.4%.
Let’s now analyze the entire predicted distribution that the model outputs. Recall, the line of best fit is y = 2x + 3. Therefore, for any input x, the mean of the predicted distribution should be around 2x + 3. Likewise, because random gaussian noise was added to the data with a standard deviation of 0.3, the standard deviation of the predicted distribution should be around 0.3. Let’s test this:
# Give the model a new input of x = 0.4
x = np.array([0.4])
# We expect the mean of this array to be about 2*0.4 + 3 = 3.8
# We expect the standard deviation of this array to be about 0.30
y_pred = model.predict(x.reshape(-1, 1))
mu = np.mean(y_pred)
sigma = np.std(y_pred)
print(mu) # Output: 3.836147287742427
print(sigma) # Output: 0.3283984093786787
# Plot the predicted distribution
fig, ax = plt.subplots(figsize=(10, 6))
_ = ax.hist(y_pred.reshape(-1,1), density=True)
ax.set_xlabel('$y$')
ax.set_title(f'Predicted Distribution $P(y|x=4)$, $mu$ = {round(mu, 3)}, $sigma$ = {round(sigma, 3)}')

Amazingly, the predicted distribution seems to align with our expectations. Next, let’s try a more difficult example.
Example 2— Non-Linear Regression with Variable Noise
To see the true power of multi-quantile regression, let’s make the learning task more difficult. The following code creates a non-linear dataset with heterogeneous noise that depends on specific regions of the domain:
# Create regions of the domain that have variable noise
bounds = [(-10, -8), (-5, -4), (-4, -3), (-3, -1), (-1, 1), (1, 3), (3, 4), (4, 5), (8, 10)]
scales = [18, 15, 8, 11, 1, 2, 9, 16, 19]
x_train = np.array([])
x_test = np.array([])
y_train = np.array([])
y_test = np.array([])
for b, scale in zip(bounds, scales):
# Randomly select the number of samples in each region
n = np.random.randint(low=100, high = 200)
# Generate values of the domain between b[0] and b[1]
x_curr = np.linspace(b[0], b[1], n)
# For even scales, noise comes from an exponential distribution
if scale % 2 == 0:
noise_train = np.random.exponential(scale=scale, size=n)
noise_test = np.random.exponential(scale=scale, size=n)
# For odd scales, noise comes from a normal distribution
else:
noise_train = np.random.normal(scale=scale, size=n)
noise_test = np.random.normal(scale=scale, size=n)
# Create training and testing sets
y_curr_train = x_curr**2 + noise_train
y_curr_test = x_curr**2 + noise_test
x_train = np.concatenate([x_train, x_curr])
x_test = np.concatenate([x_test, x_curr])
y_train = np.concatenate([y_train, y_curr_train])
y_test = np.concatenate([y_test, y_curr_test])
The resulting training data looks like this:

We’ll fit the Catboost regressor in the same way as example 1 and visualize the predictions on a test set:
model = CatBoostRegressor(iterations=300,
loss_function=f'MultiQuantile:alpha={quantile_str}')
model.fit(x_train.reshape(-1,1), y_train)
preds = model.predict(x_test.reshape(-1, 1))
preds = pd.DataFrame(preds, columns=[f'pred_{q}' for q in quantiles])
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(x_test, y_test)
for col in ['pred_0.05', 'pred_0.5', 'pred_0.95']:
quantile = int(float(col.split('_')[-1])*100)
label_name = f'Predicted Quantile {quantile}'
ax.scatter(x_test.reshape(-1,1), preds[col], alpha=0.50, label=label_name)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title('Testing Data for Example 2 with Predicted Quantiles')
ax.legend()

Upon visual inspection, the model has characterized this non-linear, heteroscedastic relationship well. Notice how, near x = 0, the three predicted quantiles converge towards a single value. This is because the region near x = 0 has almost no noise – any model that correctly predicts the conditional Probability distribution in this region should predict a small variance. Conversely, when x is between 7.5 and 10.0, the predicted quantiles are much further apart because of the inherent noise in the region. 90% coverage can be computed as before:
coverage_90 = np.mean((y_test <= preds['pred_0.95']) & (y_test >= preds['pred_0.05']))*100
print(coverage_90)
# Output: 90.506
Using the 5th and 95th quantiles, assuming perfect calibration, the expected coverage is 95–5 = 90%. In this example, the predicted quantiles were even better than example 1, giving a coverage of 90.506%.
Finally, let’s look at a few inputs and their corresponding predicted distribution.

The distribution above does a nice job of capturing the target value with a relatively small variance. This is to be expected as the target values in the training data when x = 0 have little noise. Contrast this with the predicted target distribution when x = -8.56:

This distribution is right skewed and has a much higher variance. This is expected for this region of data because the noise was sampled from an exponential distribution with high variance.
Final Thoughts
This article demonstrated the power of multi-quantile regression for learning arbitrary conditional target distributions. We only explored two one-dimensional examples to visually inspect model performance, but I would encourage anyone interested to try the multi-quantiles loss function on higher dimensional data. It’s important to note that quantile regression makes no statistical or algorithmic guarantees of convergence, and the performance of these models will vary depending on the nature of the learning problem. Thanks for reading!
Become a Member: https://harrisonfhoffman.medium.com/membership
Buy me a coffee: https://www.buymeacoffee.com/HarrisonfhU
References
- Catboost Loss Functions – https://catboost.ai/en/docs/concepts/loss-functions-regression#MultiQuantile