Digit Significance in Machine Learning

An End to Chasing Statistical Ghosts

Vincent Vanhoucke
Towards Data Science

--

The ghost in the Gaussian. Drawing from the author.

Standards for reporting performance metrics in machine learning are, alas, not often discussed. Since there doesn’t appear to be an explicit, widely shared agreement on the topic, I thought it might be interesting to offer the standard that I have been advocating, and which I attempt to follow as much as possible. It derives from this simple premise, which was drilled into me by my science teachers since middle school:

A general rule for scientific reporting is that every digit you write down ought to be ‘true,’ for whichever definition of ‘true’ is applicable.

Let’s examine what this means for statistical quantities such as test performance. When you write the following statement in a scientific publication:

The test accuracy is 52.34%.

you are stating that, to the best of your knowledge, the probability of success of your model on unseen data drawn from the test distribution lives between 0.52335 and 0.52345.

This is a very strong statement to be making.

Consider that your test set consists of N samples drawn IID from the correct test distribution. The number of successes is s. Success rate can be represented as a binomial variable, and its average probability p estimated from the sample mean: ps / N

Its standard deviation is: σ = √p(1-p)

Which is bounded above by 0.5 when p = 0.5.

Under the Normal approximation, the standard deviation of the estimator is: δ = σ/√N

This error δ on the accuracy estimate looks like this, in the worst case of ~50% accuracy:

Table of sample size vs. error. From the author.

In other words, in order to report the 52.34% accuracy of the example above with any confidence, the size of your test set should be at least on the order of 30M samples! This crude analysis readily translates to any countable quantity besides accuracy, though not to continuous figures like likelihood or perplexity.

Here are some illustrations on some common machine learning datasets:

How many digits of accuracy can one reasonably report on ImageNet? Accuracies are in the ~80%, and the test set is 150k images:

√(0.8*0.2/150000) = 0.103%

Which means you can almost report XX.X% figures, and in practice everyone does.

How about MNIST, with accuracies in the 99%:

√(0.99*0.01/10000) = 0.099%

Pffew, just ok to report XX.X% as well!

The biggest caveat, however, is that in most cases, performance figures are not reported in isolation, but are used to compare multiple methods on the same test set. In that scenario, the sampling variance between the arms of the experiment cancels out, and the difference in accuracy between them may be statistically significant even with a smaller sample size. A simple approach to estimating the variance of the figure is to perform bootstrap resampling. A more rigorous, and generally tighter test involves conducting a paired difference test or more generally analysis of variance.

It could be tempting to report digits beyond their intrinsic precision, since performance figures tend to be more significant in the context of a baseline to compare against, or when one considers the test set to be fixed as opposed to a sample drawn from the test distribution. This is a practice that leads to surprises when deploying models in production, and suddenly the fixed test set assumption vanishes, along with insignificant improvements. More generally, it is a practice that leads straight down the dark path of overfitting to the test set.

So what does it mean for a digit to be ‘true’ in our field? Well, it’s complicated. For engineers, it’s easy to argue that dimensions should not be reported past tolerances. Or for physicists, that physical quantities should not go past measurement errors. For machine learning practitioners, not only do we have to contend with the sampling uncertainty of our test set, but also to the model uncertainty under independent training runs, different initializations and shufflings of the training data.

By that standard, it can be very difficult to ascertain what digits are ‘true’ in machine learning. The antidote is of course to report confidence intervals whenever possible. Confidence intervals are a more granular way to report uncertainty, and can take into account all sources of randomness, as well as significance testing beyond simple variances. Their presence also signals to your reader that you’ve thought through the meaning of what you report beyond the number that your code spits out. A figure presented with confidence intervals can be reported beyond its nominal precision, though beware, you now have to contend with how many digits to report the uncertainty with, as this blog post explains. It’s turtles all the way down.

Fewer digits make for less clutter and better science.

Avoid reporting digits that are beyond statistical significance, unless you provide an explicit confidence interval for them. This can rightfully be considered scientific malpractice, particularly when used to argue that one number is better than another in the absence of paired significance testing. Papers are routinely rejected on this basis alone. It is a healthy habit to always be skeptical of accuracy figures that are reported with a large number of digits. Remember the 3000, 300k, and 30M rule-of-thumb limits on the number of samples required for statistical significance in the worst case as a ‘smell test.’ It will save you from chasing statistical ghosts.

(With thanks to a number of colleagues who contributed invaluable comments to earlier versions of this article.)

--

--

I am a Distinguished Scientist at Google, working on Machine Learning and Robotics.