Conformal Prediction for Machine Learning Classification —From the Ground Up

Implementing conformal prediction for classification without need of bespoke packages, and how to balance coverage (recall) across classes

Published in

Towards Data Science

11 min readNov 24, 2023

This blog post is inspired by Chris Molner’s book — Introduction to Conformal Prediction with Python. Chris is brilliant at making new machine learning techniques accessible to others. I’d especially also recommend his books on Explainable Machine Learning.

A GitHub repository with the full code (and a link to running the code online) may be found here: Conformal Prediction.

What is Conformal Prediction?

Conformal prediction is both a method of uncertainty quantification, and a method of classifying instances (which may be fine-tuned for classes or subgroups). Uncertainty is conveyed by classification being in sets of potential classes rather than single predictions.

Conformal prediction specifies a coverage, which specifies the probability that the true outcome is covered by the prediction region. The interpretation of prediction regions in conformal prediction depends on the task. For classification we get prediction sets, while for regression we get prediction intervals.

Below is an example of the difference between ‘traditional’ classification (balance of likelihood) and conformal prediction (sets).

The difference between ‘normal’ classification based on most likely class, and conformal prediction which creates sets of possible classes.

The advantages of this method are:

Guaranteed coverage: Prediction sets generated by conformal prediction come with coverage guarantees of the true outcome — that is that they will detect whatever percentage of true values you set as a minimum target coverage. Conformal prediction does not depend on a well calibrated model — the only thing that matters is that, like all machine learning, the new samples being classified must come from similar data distributions to the training and calibration data. Coverage can also be guaranteed across classes or subgroups, though this takes an extra step in the method which we will cover.
Easy to use: Conformal prediction approaches can be implemented from scratch with just a few lines of code, as we will do here.
Model-agnostic: Conformal prediction works with any machine learning model. It uses the normal outputs of whatever you preferred model is.
Distribution-free: Conformal prediction makes no assumptions about underlying distributions of data; it is a non-parametric method.
No retraining required: Conformal prediction can be used without retraining your model. It is another way of looking at, and using, model outputs.
Broad application: conformal prediction works for tabular data classification, image or time-series classification, regression, and many other tasks, though we will demonstrate just classification here.

Why should we care about uncertainty quantificiation?

Uncertainty quantification is essential in many situations:

When we use model predictions to make decisions. How sure are we of those predictions? Is using just ‘most likely class’ good enough for the task we have?
When we want to communicate the uncertainty associated with our predictions to stakeholders, without talking about probabilities or odds, or even log-odds!

Alpha in conformal prediction — describes coverage

Coverage is key to conformal prediction. In classification it is the normal region of data that a particular class inhabits. Coverage is equivalent to sensitivity or recall; it is the proportion of observed values that are identified in the classification sets. We can tighten or loosen the area of coverage by adjusting 𝛼 (coverage = 1 — 𝛼).

Let’s code!

Import packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

Create synthetic data for classification

Example data will be produced using SK-Learn’s `make_blobs` method.

n_classes = 3
# Make train and test data
X, y = make_blobs(n_samples=10000, n_features=2, centers=n_classes, cluster_std=3.75, random_state=42)

# Reduce the size of the first class to create an imbalanced dataset

# Set numpy random seed
np.random.seed(42)
# Get index of when y is class 0
class_0_idx = np.where(y == 0)[0]
# Get 30% of the class 0 indices
class_0_idx = np.random.choice(class_0_idx, int(len(class_0_idx) * 0.3), replace=False)
# Get the index for all other classes
rest_idx = np.where(y != 0)[0]
# Combine the indices
idx = np.concatenate([class_0_idx, rest_idx])
# Shuffle the indices
np.random.shuffle(idx)
# Split the data
X = X[idx]
y = y[idx]

# Split off model training set
X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=0.5, random_state=42)
# Split rest into calibration and test
X_Cal, X_test, y_cal, y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=42)

# Set class labels
class_labels = ['blue', 'orange', 'green']

# Plot the data
fig = plt.subplots(figsize=(5, 5))
ax = plt.subplot(111)
for i in range(n_classes):
    ax.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], label=class_labels[i], alpha=0.5, s=10)
legend = ax.legend()
legend.set_title("Class")
ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")
plt.show()

Generated data (the data is created to be imbalanced — the blue class has only about 30% of the data points of either the green or orange classes).

Build a classifier

We will use a simple logistic regression model here, but the method can work with any model from a simple logistic regression model based on tabular data through to 3D ConvNets for image classification.

# Build and train the classifier
classifier = LogisticRegression(random_state=42)
classifier.fit(X_train, y_train)

# Test the classifier
y_pred = classifier.predict(X_test)
accuracy = np.mean(y_pred == y_test)
print(f"Accuracy: {accuracy:0.3f}")

# Test recall for each class
for i in range(n_classes):
    recall = np.mean(y_pred[y_test == i] == y_test[y_test == i])
    print(f"Recall for class {class_labels[i]}: {recall:0.3f}")

Accuracy: 0.930
Recall for class blue: 0.772
Recall for class orange: 0.938
Recall for class green: 0.969

Note how recall for the minority class is lower than the other classes. Recall, otherwise known as sensitivity, is the number in a class that are correctly identified by the classifier.

S_i, or the non-conformity score score

In conformal prediction, the non-conformity score, often denoted as s_i, is a measure of how much a new instance deviates from the existing instances in the training set. It’s used to determine whether a new instance belongs to a particular class or not.

In the context of classification, the most common non-conformity measure is 1 — predicted class probability for the given label. So, if the predicted probability of a new instance belonging to a certain class is high, the non-conformity score will be low, and vice versa.

For conformal prediction we obtain s_i scores for all classes (note: here we only look at the model output for the true class of an instance, even when there is a higher predicted probability of being another class). We then find a threshold of scores that contains (or covers) 95% of the data. The classification will then identify 95% of new instances (so long as our new data is similar to our training data).

Calculate conformal prediction threshold

We will now predict classification probabilities of the calibration set. This will be used to set a classification threshold for new data.

# Get predictions for calibration set
y_pred = classifier.predict(X_Cal)
y_pred_proba = classifier.predict_proba(X_Cal)

# Show first 5 instances
y_pred_proba[0:5]

array([[4.65677826e-04, 1.29602253e-03, 9.98238300e-01],
       [1.73428257e-03, 1.20718182e-02, 9.86193899e-01],
       [2.51649788e-01, 7.48331668e-01, 1.85434981e-05],
       [5.97545130e-04, 3.51642214e-04, 9.99050813e-01],
       [4.54193815e-06, 9.99983628e-01, 1.18300819e-05]])

Calculate non-conformality scores

We will calculate s_i scores only based on looking at probabilities associated with the observed class. For each instance we will get the predicted probability for the class of that instance. The s_i score (non-conformality) is 1-probability. The higher the s_i score, the less that example conforms to that class in comparison to other classes. Other methods of calculating a non-conformity score are available!

si_scores = []
# Loop through all calibration instances
for i, true_class in enumerate(y_cal):
    # Get predicted probability for observed/true class
    predicted_prob = y_pred_proba[i][true_class]
    si_scores.append(1 - predicted_prob)    

# Convert to NumPy array
si_scores = np.array(si_scores)

# Show first 5 instances
si_scores[0:5]

array([1.76170035e-03, 1.38061008e-02, 2.51668332e-01, 9.49187344e-04,
       1.63720201e-05])

Get 95th percentile threshold

The threshold determines what coverage our classification will have. Coverage refers to the proportion of predictions that actually contain the true outcome.

The threshold is the percentile corresponding to 1 — 𝛼. To get 95% coverage, we set an 𝛼 of 0.05.

When used in real life, the quantile level (based on 𝛼) requires a finite sample correction to calculate the corresponding quantile 𝑞. We multiple 0.95 by (n+1)/n, which means that 𝑞𝑙𝑒𝑣𝑒𝑙 would be 0.951 for n = 1000.

number_of_samples = len(X_Cal)
alpha = 0.05
qlevel = (1 - alpha) * ((number_of_samples + 1) / number_of_samples)
threshold = np.percentile(si_scores, qlevel*100)
print(f'Threshold: {threshold:0.3f}')

Threshold: 0.598

Show chart of s_i values, with cut-off threshold.

x = np.arange(len(si_scores)) + 1
sorted_si_scores = np.sort(si_scores)
index_of_95th_percentile = int(len(si_scores) * 0.95)

# Color by cut-off
conform = 'g' * index_of_95th_percentile
nonconform = 'r' * (len(si_scores) - index_of_95th_percentile)
color = list(conform + nonconform)

fig = plt.figure(figsize=((6,4)))
ax = fig.add_subplot()

# Add bars
ax.bar(x, sorted_si_scores, width=1.0, color = color)

# Add lines for 95th percentile
ax.plot([0, index_of_95th_percentile],[threshold, threshold], 
        c='k', linestyle='--')
ax.plot([index_of_95th_percentile, index_of_95th_percentile], [threshold, 0],
        c='k', linestyle='--')

# Add text
txt = '95th percentile conformality threshold'
ax.text(5, threshold + 0.04, txt)

# Add axis labels
ax.set_xlabel('Sample instance (sorted by $s_i$)')
ax.set_ylabel('$S_i$ (non-conformality)')

plt.show()

s_i scores for all data. The threshold is the s_i level that contains 95% of all data (if 𝛼 is set at 0.05).

Get samples/classes from test set classified as positive

We can now find all those model outputs less than the threshold.

It is possible for an individual example to have no predicted value (an empty set), or more than one value, below the threshold.
Let’s get the classifications that are below our non-comformality threshold and look at the first 10 examples. Each set is a list of True/False for each possible class.

prediction_sets = (1 - classifier.predict_proba(X_test) <= threshold)
# Show first ten instances
prediction_sets[0:10]

array([[ True, False, False],
       [False, False,  True],
       [ True, False, False],
       [False, False,  True],
       [False,  True, False],
       [False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False,  True, False],
       [False,  True, False]])

Get prediction set labels, and compare to standard classification.

# Get standard predictions
y_pred = classifier.predict(X_test)

# Function to get set labels
def get_prediction_set_labels(prediction_set, class_labels):
    # Get set of class labels for each instance in prediction sets
    prediction_set_labels = [
        set([class_labels[i] for i, x in enumerate(prediction_set) if x]) for prediction_set in 
        prediction_sets]
    return prediction_set_labels

# Collate predictions
results_sets = pd.DataFrame()
results_sets['observed'] = [class_labels[i] for i in y_test]
results_sets['labels'] = get_prediction_set_labels(prediction_sets, class_labels)
results_sets['classifications'] = [class_labels[i] for i in y_pred]
results_sets.head(10)


   observed  labels           classifications
0  blue      {blue}           blue
1  green     {green}          green
2  blue      {blue}           blue
3  green     {green}          green
4  orange    {orange}         orange
5  orange    {orange}         orange
6  orange    {orange}         orange
7  orange    {blue, orange}   blue
8  orange    {orange}         orange
9  orange    {orange}         orange

Note instance 7 is actually orange class, but has been classified by the simple classifier as blue. The conformal prediction classes it as a set of orange and blue.

Plot data showing instance 7 which is predicted to possibly be in 2 classes:

# Plot the data
fig = plt.subplots(figsize=(5, 5))
ax = plt.subplot(111)
for i in range(n_classes):
    ax.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1],
               label=class_labels[i], alpha=0.5, s=10)
# Add instance 7
set_label = results_sets['labels'].iloc[7]
ax.scatter(X_test[7, 0], X_test[7, 1], color='k', s=100, marker='*', label=f'Instance 7')
legend = ax.legend()
legend.set_title("Class")
ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")
txt = f"Prediction set for instance 7: {set_label}"
ax.text(-20, 18, txt)
plt.show()

Scatter plot showing how test instance 7 was classified as belonging to two possible sets: {‘blue’, ‘orange’},

Show coverage and average set size

Coverage is the proportion of prediction sets that actually contain the true outcome.

Average set size is the average number of predicted classes per instance.

We will define some functions to calculate the results.

# Get class counts
def get_class_counts(y_test):
    class_counts = []
    for i in range(n_classes):
        class_counts.append(np.sum(y_test == i))
    return class_counts

# Get coverage for each class
def get_coverage_by_class(prediction_sets, y_test):
    coverage = []
    for i in range(n_classes):
        coverage.append(np.mean(prediction_sets[y_test == i, i]))
    return coverage

# Get average set size for each class
def get_average_set_size(prediction_sets, y_test):
    average_set_size = []
    for i in range(n_classes):
        average_set_size.append(
            np.mean(np.sum(prediction_sets[y_test == i], axis=1)))
    return average_set_size     

# Get weighted coverage (weighted by class size)
def get_weighted_coverage(coverage, class_counts):
    total_counts = np.sum(class_counts)
    weighted_coverage = np.sum((coverage * class_counts) / total_counts)
    weighted_coverage = round(weighted_coverage, 3)
    return weighted_coverage

# Get weighted set_size (weighted by class size)
def get_weighted_set_size(set_size, class_counts):
    total_counts = np.sum(class_counts)
    weighted_set_size = np.sum((set_size * class_counts) / total_counts)
    weighted_set_size = round(weighted_set_size, 3)
    return weighted_set_size

Show results for each class. The average set size is the average number of predicted classes per instance, for each class. Higher numbers represent more overlap between classification regions for the different classes.

results = pd.DataFrame(index=class_labels)
results['Class counts'] = get_class_counts(y_test)
results['Coverage'] = get_coverage_by_class(prediction_sets, y_test)
results['Average set size'] = get_average_set_size(prediction_sets, y_test)
results

        Class counts  Coverage   Average set size
blue    241           0.817427   1.087137
orange  848           0.954009   1.037736
green   828           0.977053   1.016908

Show overall results

weighted_coverage = get_weighted_coverage(
    results['Coverage'], results['Class counts'])

weighted_set_size = get_weighted_set_size(
    results['Average set size'], results['Class counts'])

print (f'Overall coverage: {weighted_coverage}')
print (f'Average set size: {weighted_set_size}')

Overall coverage: 0.947
Average set size: 1.035

NOTE: Though our overall coverage is as desired, being very close to 95%, coverage of the different classes varies, and is lowest (83%) for our smallest class. If coverage of individual classes is important we can set out thresholds for classes independently, which is what we will now do.

Conformal classification with equal coverage across classes

When we want to be sure of coverage across all classes, we can set thresholds for each class independently.

Note: we could also do this for subgroups of data, such as ensuring equal coverage for a diagnostic across racial groups, if we found coverage using a shared threshold led to problems.

Get thresholds for each class independently

For each class we will find the threshold s_i score that covers 95% of insdtances in that particular class.

# Set alpha (1 - coverage)
alpha = 0.05
thresholds = []
# Get predicted probabilities for calibration set
y_cal_prob = classifier.predict_proba(X_Cal)
# Get 95th percentile score for each class's s-scores
for class_label in range(n_classes):
    mask = y_cal == class_label
    y_cal_prob_class = y_cal_prob[mask][:, class_label]
    s_scores = 1 - y_cal_prob_class
    q = (1 - alpha) * 100
    class_size = mask.sum()
    correction = (class_size + 1) / class_size
    q *= correction
    threshold = np.percentile(s_scores, q)
    thresholds.append(threshold)

print(thresholds)

[0.9030202125697161, 0.6317149025299887, 0.26033562285411]

Apply class-specific threshold to each class classification

We will test instances for whether they are below the threshold for each class.

# Get Si scores for test set
predicted_proba = classifier.predict_proba(X_test)
si_scores = 1 - predicted_proba

# For each class, check whether each instance is below the threshold
prediction_sets = []
for i in range(n_classes):
    prediction_sets.append(si_scores[:, i] <= thresholds[i])
prediction_sets = np.array(prediction_sets).T

# Get prediction set labels and show first 10
prediction_set_labels = get_prediction_set_labels(prediction_sets, class_labels)

# Get standard predictions
y_pred = classifier.predict(X_test)

# Collate predictions
results_sets = pd.DataFrame()
results_sets['observed'] = [class_labels[i] for i in y_test]
results_sets['labels'] = get_prediction_set_labels(prediction_sets, class_labels)
results_sets['classifications'] = [class_labels[i] for i in y_pred]

# Show first 10 results
results_sets.head(10)

  observed  labels           classifications
0 blue     {blue}            blue
1 green    {green}           green
2 blue     {blue}            blue
3 green    {green}           green
4 orange   {orange}          orange
5 orange   {orange}          orange
6 orange   {orange}          orange
7 orange   {blue, orange}    blue
8 orange   {orange}          orange
9 orange   {orange}          orange

Check coverage and set size across classes

We now have about 95% coverage across all classes. The conformal prediction method gives us better coverage of the minority class than the standard method of classification.

results = pd.DataFrame(index=class_labels)
results['Class counts'] = get_class_counts(y_test)
results['Coverage'] = get_coverage_by_class(prediction_sets, y_test)
results['Average set size'] = get_average_set_size(prediction_sets, y_test)
results

        Class counts  Coverage   Average set size
blue    241           0.954357   1.228216
orange  848           0.956368   1.139151
green   828           0.942029   1.006039

weighted_coverage = get_weighted_coverage(
    results['Coverage'], results['Class counts'])

weighted_set_size = get_weighted_set_size(
    results['Average set size'], results['Class counts'])

print (f'Overall coverage: {weighted_coverage}')
print (f'Average set size: {weighted_set_size}')

Overall coverage: 0.95
Average set size: 1.093

Summary

Conformal prediction was used to classify instances in sets rather than single predictions. Instances on the borders between two classes were labelled with both classes rather than picking the class with highest probability.

When it is important that all classes are detected with the same coverage, the threshold for classifying instances may be set separately (this method could also be used for subgroups of data, for example to guarantee the same coverage across different ethnic groups).

Conformal prediction does not change the model predictions. It just uses them in a different way to traditional classification. It may be used alongside more traditional methods.

(All images are by the author)