
In classification tasks Machine Learning models output estimated probabilities or also called confidences (see image above). These tell us how certain a model is in its label predictions. However, for most models these confidences are not aligned with the true frequencies of the events that it is predicting. They need to be calibrated!
Model calibration aims to align the predictions of a model with the true probabilities and thereby making sure that the predictions of a model are reliable and accurate (see this blog post for more details on the importance of model calibration).
Alright, so model calibration is important, but how do we measure it? There are a few options, but the purpose and focus of this article is to explain and run through only one simple yet relatively sufficient measure for assessing model calibration: the Expected Calibration Error (ECE). It calculates the weighted average error of the estimated "probabilities" thus resulting in a single value that we can use to compare different models.
We will run through the ECE-formula as described in the paper: On Calibration of Modern Neural Networks. To make it simple we will look at a small example with 9 data-points and binary targets. We will then also code up this simple example in Python, and lastly, showcase the code on a multi-class classification example.
Definition
ECE measures how well a model’s estimated "probabilities" match the true (observed) probabilities by taking a weighted average over the absolute difference between accuracy (acc) and confidence (conf):

The measure involves splitting the data into M equally spaced bins. B is used for representing "bins" and m for the bin number. We’ll get back to the individual parts of this formula such as B, |Bₘ|, acc(Bₘ) and conf(Bₘ) in more detail later. Let’s first look at our example which will help make the formula easier to digest step-by-step.
Example
We have 9 samples with estimated probabilities or also called ‘confidences’ (pᵢ) for predicting either 0 or 1. If the probability pᵢ for label 0 is above 0.5, then the predicted label will be 0. If it is below 0.5, then the probability will be higher for label 1 and thereby the predicted label will be 1 (see table below). The final column shows the true label of a sample i.

From the table above we can see that we have 9 samples, n=9. To determine any of the rest in the formula we will first need to split our samples into bins.
Only the probabilities, which determine the predicted label are used in calculating ECE. Therefore, we will only bin samples based upon the maximum probability across labels (see table 2). To keep the example simple we split the data into 5 equally spaced bins M=5 (see binning plot 1 on the right). Let’s assign each bin a colour:

Now if we look at each sample’s maximum estimated probability, we can group it into one of the 5 bins. Sample i=1 has an estimated probability of 0.78 this is higher than 0.6 but lower than 0.8, which means we group it into B₄, see image below. **** Now let’s look at sample i=3, which has an estimate of _0.92. This falls between 0.8 and 1, so into bin B₅. We repeat this for every sample ****_ i and end up with a categorisation as in table 2 (see below).


B₁ and B₂ don’t contain any samples (due to the nature of the binary example max. probabilities will always be ≥ 0.5 in the binary case). B₃ contains 2 samples. 4 samples end up falling into bin B₄ and 3 end up in B₅. This already gives us some information to start filling in the ECE formula from above. Specifically, we can calculate the empirical probability of a sample falling into bin m : |Bₘ|/n (see red highlight below).

We know that n equals 9 and from the binning process above we also know the size of each bin: |Bₘ| (the size of a set S is written as |S| – for the values, see numbers above). If we split out the formula colour-coded for each bin, this gives us the following:

From the binning done above we can now also determine conf(Bₘ), which represents the average estimated probabilities in bin m, defined as follows in the paper:

To calculate conf(Bₘ) we take the sum of the maximum estimated probabilities p̂ᵢ for each bin m in table 2 and then divide it by the size of the bin |Bₘ|, see below on the right:


We can then update the ECE calculation with these values:

And now we are only left with filling in acc(Bₘ), which **** represents the _accuracy per bin ****_ m, defined as follows in the paper:

1 is an indicator function, meaning when the predicted label ŷᵢ equals the true label yᵢ it evaluates to 1, otherwise 0. This simply means you count the number of correctly predicted samples per bin m and divide it by the size of the bin |Bₘ|. To do this we need to first determine if a sample was correctly predicted or not. Let’s use the following colours

and apply them to the last 2 columns and we can then colour the samples in the plot to the right in the same way:


Looking at the right side of the plot above, we can see that in bin B₃ we have 2 samples and 1 correct prediction, meaning that the accuracy for B₃ is 1/2. Repeating this for B₄ gives an accuracy of 3/4 as we have 3 correct predictions and 4 samples in bin B₄. Lastly, looking at B₅ we have 3 samples and 2 correct predictions, so we end up with 2/3. This gives us the following accuracy values for each bin:

We now have all elements for calculating the ECE:

In our small example of 9 samples we end up with an ECE of 0.10445. A perfectly calibrated model would have an ECE of 0. The larger the ECE, the more uncalibrated the model.
ECE is a helpful first measure and is widely used for assessing model calibration. However, ECE has drawbacks that one should be aware of when using it to measure calibration _(see: Measuring Calibration in Deep Learning)._
Python Code
Numpy
First we will set up the same example from above:
import numpy as np
# Binary Classification
samples = np.array([[0.78, 0.22],
[0.36, 0.64],
[0.08, 0.92],
[0.58, 0.42],
[0.49, 0.51],
[0.85, 0.15],
[0.30, 0.70],
[0.63, 0.37],
[0.17, 0.83]])
true_labels = np.array([0,1,0,0,0,0,1,1,1])
We then define the ECE function as follows:
def expected_calibration_error(samples, true_labels, M=5):
# uniform binning approach with M number of bins
bin_boundaries = np.linspace(0, 1, M + 1)
bin_lowers = bin_boundaries[:-1]
bin_uppers = bin_boundaries[1:]
# get max probability per sample i
confidences = np.max(samples, axis=1)
# get predictions from confidences (positional in this case)
predicted_label = np.argmax(samples, axis=1)
# get a boolean list of correct/false predictions
accuracies = predicted_label==true_labels
ece = np.zeros(1)
for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
# determine if sample is in bin m (between bin lower & upper)
in_bin = np.logical_and(confidences > bin_lower.item(), confidences <= bin_upper.item())
# can calculate the empirical probability of a sample falling into bin m: (|Bm|/n)
prob_in_bin = in_bin.mean()
if prob_in_bin.item() > 0:
# get the accuracy of bin m: acc(Bm)
accuracy_in_bin = accuracies[in_bin].mean()
# get the average confidence of bin m: conf(Bm)
avg_confidence_in_bin = confidences[in_bin].mean()
# calculate |acc(Bm) - conf(Bm)| * (|Bm|/n) for bin m and add to the total ECE
ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prob_in_bin
return ece
Calling the function on the binary example returns the same value as we calculated above 0.10444 (rounded).
expected_calibration_error(samples, true_labels)
In addition to the binary example, we can now also quickly run through a multi-class classification case. Let’s use James D. McCaffrey’s example. This gives us 5 target classes and the associated sample confidences. We really only need the target indices for our calculation: [0,1,2,3,4] and can, with regard to ECE, ignore the label that they correspond to. Looking at sample i=1, we can see that we now have 5 estimated probabilities, one associated with each class: [0.25,0.2,0.22,0.18,0.15].
# Multi-class Classification
samples_multi = np.array([[0.25,0.2,0.22,0.18,0.15],
[0.16,0.06,0.5,0.07,0.21],
[0.06,0.03,0.8,0.07,0.04],
[0.02,0.03,0.01,0.04,0.9],
[0.4,0.15,0.16,0.14,0.15],
[0.15,0.28,0.18,0.17,0.22],
[0.07,0.8,0.03,0.06,0.04],
[0.1,0.05,0.03,0.75,0.07],
[0.25,0.22,0.05,0.3,0.18],
[0.12,0.09,0.02,0.17,0.6]])
true_labels_multi = np.array([0,2,3,4,2,0,1,3,3,2])
Calling the function on the multi-class ** example returns _0.192 (which differs from McCaffrey’s calculation by 0.00**2 due to differences in rounding_!).
Give the Google Colab Notebook a go and try it out for yourself in numpy or PyTorch (see below).
You should now know how to calculate ECE by hand and using numpy 🙂
Link to Google Colab Notebook _that runs through the binary and multi-class classification example in both numpy and PyTorch. **** Note: The code in this article was adapted from the paper’s ECE torch class from their GitHub repo_.