Beware of the Hidden Error in Your Test Score

Why you should report confidence intervals on your test set

Published in

Towards Data Science

6 min readJul 28, 2022

Fig. 1: The test score of your machine learning model is subject to statistical fluctuations. Image by author.

In experimental sciences, we are used to reporting estimates with error bars and significant digits. For example, when you weigh your sample in the lab, you can read off its mass up to, say, three digits. In machine learning this is different. When you evaluate your model’s accuracy, you get a value with a numerical error up to machine precision. It is almost as if the accuracy estimate spat out by your model is reliable up to seven decimals. Unfortunately, looks can be deceiving. There is a hidden error in your test score. An insurmountable variation intrinsic to the stochastic nature of data. An error potentially so large, that it completely determines the reliability of your model’s performance score.

I am talking about statistical fluctuations.

Case

Imagine you’ve just been brought into a new biotech company as a data scientist. Your task? To predict if a patient needs life-saving surgery using their cutting edge measurement device. The CEO has expressed great confidence in you, and has allocated € 100,000 euro for your project. Since the technology is still at its infancy, each measurement is still fairly expensive, costing € 2,500 per sample. You decide to spend your entire budget on data acquisition and set out to collect 20 training and 20 test samples.

(You can follow the narrative by executing the Python code blocks.)

from sklearn.datasets import make_blobscenters = [[0, 0], [1, 1]]
X_train, y_train = make_blobs(
    centers=centers, cluster_std=1, n_samples=20, random_state=5
)
X_test, y_test = make_blobs(
    centers=centers, cluster_std=1, n_samples=20, random_state=1005
)

Fig. 2: Training data for the positive label (red crosses) and negative label (blue circles). Image by author.

After completing the measurements, you visualise the training dataset (Fig. 2). It is still rather difficult to make out distinct patterns, given this little data. You therefore start by establishing a baseline performance using a simple linear model: logistic regression.

from sklearn.linear_model import LogisticRegressionbaseline_model = LogisticRegression(random_state=5).fit(X_train, y_train)
baseline_model.score(X_test, y_test)  # Output: 0.85.

Actually, that’s not bad: 85 % accuracy on the test set. Having established a strong baseline, you set out to venture into a more complex model. After some deliberation, you decide to give gradient boosted trees a go, given their success on Kaggle.

from sklearn.ensemble import GradientBoostingClassifiertree_model = GradientBoostingClassifier(random_state=5).fit(X_train, y_train)
tree_model.score(X_test, y_test)  # Output: 0.90.

Wow! An accuracy of 90 %. Full of excitement, you report your findings back to the CEO. She appears delighted by your great success. Together, you decide to deploy the more complex classifier into production.

Shortly after putting the model into production, you start receiving complaints from your customers. It seems that your model may not perform as well as your test set accuracy suggested.

What’s going on? And what should you do? Roll back to the simpler, but worse performing, baseline model?

Statistical fluctuations

To understand statistical fluctuations, we have to look at the sampling process. When we collect data, we are drawing samples from an unknown distribution. We say unknown, because if we knew the data generating distribution, then our task would be accomplished: we can perfectly classify the samples (up to the irreducible error).

Fig. 3: Assume that you collect samples from a distribution containing easy cases (correctly classifiable, blue), as well as difficult cases (incorrectly classifiable, red). In small datasets, you have a considerable chance of getting mostly easy, or mostly difficult, cases. Image by author.

Now, colour easy cases — that your model can correctly predict — blue, and colour the difficult cases (that are classified incorrectly) red (Fig. 3, left). By building a dataset, you are essentially drawing a set of red and blue balls (Fig. 3, middle). Accuracy, in this case, is the number of blue out of all balls (Fig. 3, right). Each time you construct a dataset, the number of blue balls — your model’s accuracy — fluctuates around its “true” value.

As you can see, by drawing a handful of balls, you have a fair chance of getting mostly red or mostly blue balls: statistical fluctuations are large! As you gather more data, the size of the fluctuations goes down, so that the average colour converges to its “true” value.

Another way to think of it, is that statistical fluctuations are the errors in your estimates. In experimental sciences, we usually report the mean, µ, and the standard deviation, σ. What we mean by that, is that if µ and σ were correct, we expect Gaussian fluctuations between [µ-2σ, µ+2σ] about 95 % of the time. In machine learning and statistics, we often deal with distributions more exotic than Gaussians. It is therefore more common to report the 95 % confidence interval (CI): the range of fluctuations in 95 % of the cases, irrespective of the distribution.

Let’s put this theory into practice.

Resolution: estimates with error bars

Returning to your task in a biotech startup to predict if a patients needs life-saving surgery. Having learnt about statistical fluctuations, you are beginning to suspect that these fluctuations may be at the heart of your problem. If my test set is small, then statistical fluctuations must be large, you reason. You therefore set out to quantify the range of accuracies that you might reasonably expect.

One way to quantify the statistical fluctuations in your model’s score is using a statistical technique called bootstrapping. Bootstrapping means that you take random sets of your data and use those to estimate uncertainty. A helpful Python package is statkit (pip3 install statkit), which we specifically designed to integrate with sci-kit learn.

You start by computing the confidence interval of the baseline model.

from sklearn.metrics import accuracy_score
from statkit.non_parametric import bootstrap_scorey_pred_simple = baseline_model.predict(X_test)
baseline_accuracy = bootstrap_score(
    y_test, y_pred_simple, metric=accuracy_score, random_state=5
)
print(baseline_accuracy)  # Output: 0.85 (95 % CI: 0.65-1.0)

So while the accuracy of your baseline model was 85 % on the test set, we can expect the accuracy to be in the range of 65 % — 100 %, most of the time. Evaluating the range of accuracy of the more complex model,

y_pred_tree = tree_model.predict(X_test)
tree_accuracy = bootstrap_score(y_test, y_pred_tree, metric=accuracy_score, random_state=5)
print(tree_accuracy)  # Output: 0.90 (95 % CI: 0.75–1.0)

We find that it is about the same (between 75 % and 100 %). So contrary to what you and the CEO initially believed, the more complex is not really better.

Having learnt from your mistake, you decide to rollback to your simpler baseline model. Reluctant to get more angry customers, you clearly communicate the bandwidth of your model’s performance and stay in close contact to get feedback early. After some time of diligent monitoring, you managed to collect additional data.

X_large, y_large = make_blobs(centers=centers, cluster_std=1, n_samples=10000, random_state=0)

These additional measurements allow you to more accurately estimate performance.

baseline_accuracy_large = bootstrap_score(
    y_large,
    baseline_model.predict(X_large),
    metric=accuracy_score,
    random_state=5
)
print('Logistic regression:', baseline_accuracy_large)
# Output: 0.762 (95 % CI: 0.753-0.771)tree_accuracy_large = bootstrap_score(
    y_large, 
    tree_model.predict(X_large), 
    metric=accuracy_score, 
    random_state=5
)
print('Gradient boosted trees:', tree_accuracy_large)
# Output: 0.704 (95 % CI: 0.694-0.713)

The larger dataset confirms: your simpler baseline model was indeed better.

Conclusion

Don’t be deceived by your test scores: they may be a statistical fluke. Especially for small datasets, the error due to statistical fluctuations can be large. Our advice: Embrace the unknown and quantify the uncertainty in your estimates using 95 % confidence intervals. This will prevent you from being caught off guard when real world performance is lower than suggested by the test set’s point estimate.

Acknowledgements

I would like to thank Rik Huijzer for proofreading.