Why measuring accuracy is hard (and very important)!
Part 1. Why measuring right is important.
We are kicking off a new series of blog articles on measuring accuracy. There is a lot of content to cover — when I first started writing this article, it quickly turned into 12 hours of keyboard bashing and 20 pages of thoughts and notes. So we will be breaking it up into 4 separate articles:
- An overview of the problem of measuring accuracy, why it’s important, and why it’s hard
- Detailed description of most of the major problems that come up in measuring accuracy
- Detailed description of some of the more unusual and interesting situations that we have in measuring accuracy
- What we can do about it — how to measure accuracy better
The problem of measuring accuracy
Why is measuring accuracy of a machine learning model hard? So many machine learning models would appear to have a very simple definition of accuracy.
Lets go over a simple example — an algorithm that detects diabetes. The algorithm reports that an individual does in fact have diabetes. That prediction is either correct (the individual did in fact have the diabetes), or the prediction is wrong (the individual did not have diabetes). It would seem like accuracy is defined by a pretty simple question: how many of the predictions were actually correct. It would seem that measuring the accuracy of a model should be straightforward.

A slightly closer look at the question, and how the algorithm is used in production, will give us a more nuanced answer. The typical disease prediction algorithm does not give the prediction of disease of TRUE vs FALSE, with no other information. Even the most basic implementations will at least provide a confidence score, varying between 0% and 100%. The algorithm could predict a 73% confidence of diabetes, and a 27% confidence of no diabetes. The doctor can then use the confidences, combined with their own knowledge and analysis, to make up their mind. But what is accuracy then? Is the algorithm expected to be 73% accurate when it reports a 73% confidence?
What does accuracy mean in this situation? Does accuracy mean the number of answers with more than 50% confidence that were correct? Is 50% the right threshold? How do we count the null answers? What if both yes and no have less than the required confidence? How is that counted? A slightly more nuanced and experienced data scientist will then arrive at the Receiver-Operating-Characteristic curve, or ROC curve. The ROC is no longer a single number, but a curve representing the trade-offs between false-positives and true positives that occurs when you set the threshold at different spots. It shows that, at a particular level of confidence, how many mistakes versus truth will there be. Here is an example of an ROC curve for a chat-bot. This one is pretty good:

Here is another one taken from the Scikit-learn website, which is probably more realistic for a medical classifier:

So now we don’t just have one number for accuracy. We have .. our choice of different accuracies. Now I should mention that the ROC curves come with an associated single number metric, AUC or Area-Under-Curve, which is literally the area underneath the ROC curve. AUC is a robust and very useful metric. But AUC is not an easy metric to explain to a stakeholder and doesn’t have any easy interpretation for us to understand. Maybe then we resort to using precision and recall? We find some acceptable point on the ROC curve, set the threshold to that point, and use these two new numbers to describe its outputs. But this quickly gets complicated when you have multiple classes in your outputs — we now have a whole series of precisions and recalls — and it quickly becomes bewildering to try to explain how a model behaves and performs.
The measurement of accuracy can even be affected by raw mathematics, which can make an improvement actually look worse. This example came to me when I was reading the Google AI blog article for a diabetes detection- algorithm: http://ai.googleblog.com/2018/12/improving-effectiveness-of-diabetic.html
They mention that they had switched their algorithm from doing a binary classification (diabetic / not diabetic) to one on a 5 point severity scale. The data being analyzed was the same — only the outputs have changed. But now think about this: On a binary classification, the system can guess totally randomly and has a 50–50 chance of getting the correct answer. On a 5 point grading system, the system only has a 20% chance of getting the correct answer when it guesses randomly.
So now imagine you switch from a 2 point to 5 point grading system in order to give your model richer information to learn from. If you measure the raw-accuracy however, you might notice that your switch reduced the accuracy. In the 5 point grading system, the system has a much lower chance of getting the correct answer by chance alone. The authors of the article get around this by using Kappa score rather then just vanilla accuracy (which adjusts for the chance of a randomly correct answer). But the basic problem gets to the heart of my topic in this article.
Measuring your accuracy just isn’t as straightforward as it might seem. And how we measure accuracy will alter what changes we make to our model to improve it, how much we trust a model, and most importantly, how stakeholders like business people, engineers, government, healthcare, or social services organizations adopt, integrate and use these algorithms.
Why measuring accuracy correctly is important
Why measure accuracy? This might seem to be an easy to answer question. Without measuring accuracy, there is no way to know if your model is working or not. Unlike regular code, which can be tested with a prior assumption that it works perfectly, 100% of the time as designed, machine learning code is expected to fail on some number of samples. So measuring that exact number of failures is the key to testing a machine learning system.
But I want to take a moment to touch on some of the reasons that we measure accuracy and why it becomes important to do it right.
Improving your Model
The first and easiest to understand reason you measure the accuracy of a model is so that you can improve its accuracy. When you are trying to improve the accuracy, almost any metric of accuracy is possible to use. As long as that metric has a clearly defined better or worse, then the exact value of the metric doesn’t matter. What you care about is whether the metric is improving or getting worse. So what can go wrong?
Well, if you measure the accuracy of your model incorrectly, you could actually be modifying your model in ways that would hurt your real-world performance, while they appear to be improving your metric. Take, for example, the problem of generalization vs overfitting.

If you are measuring your accuracy incorrectly, you could be making changes that appear to improve your metric, but instead they are just making your model overfit the data you’re metric is measuring against.
The standard way of solving this problem is to break your data 80/20 into training / testing. But this is also fraught with difficulty, since we sometimes use measurements of accuracy to do things like early-stopping or setting confidence thresholds, so that testing data itself then becomes part of you training process. You could be over-fitting your confidence thresholds. So then you decide to break your data three ways, 70/20/10, with an extra validation set you can measure your accuracy with at the end. But now what if your dataset size is relatively small, or its not perfectly representative of the real world data it has to operate on.
You now have to worry about another type of over-fitting I call architectural overfit, where the design and parameters of your model became too perfectly designed for the dataset, and does not generalize to new samples or learn them very well when they are added to the dataset. This can happen for example if you prepare a lot of custom made features based on your dataset, only to find they don’t apply when the dataset is grown over time, modified significantly, or merged with some other dataset. You got excellent training, testing, and validation accuracy. But your still overfitting the dataset.
Now what if your dataset has noise? What if there are consistent mistakes in the data? You might think your model is amazing — and indeed find that it is, it has perfectly learned the consistent patterns of mistakes in the dataset. You gleefully put the model into the next product, only to have your ass handed to you when the product is launched. It seems your model does not work so great in the real world. “It was perfectly accurate in testing” you might think. “What could have gone wrong?”
What if your accuracy measurement itself has noise? Let’s say you have a 1–3% spread in accuracy across multiple runs. Now this makes it harder to make incremental improvements. Every improvement needs to be larger than 2–3% in order for you to be able to reliably confirm it with a single run. Either you spend more CPU power to get clear answers using averages, or you risk spinning in circles looking only for big-wins and forgoing incremental improvement.
Measuring your accuracy better means that when you make changes changes to your model, you can be confident as to whether those changes are leading to a better model in the ways you care about. When you measure your accuracy wrong, you can end up tearing up changes or going back to the drawing board because that model you “thought” was 99.9% accurate doesn’t actually work anywhere close to that in production. Better measurement of accuracy means faster research, better products, and more accolades for you. It can even save lives.
Communicating to Stakeholders that use our Models
Another reason we measure the accuracy of the model is so that we can communicate to stakeholders and they can use our models. Models are never just pieces of math and code — they must sit and operate in the real world, having real effects on real peoples lives.
If a doctor is going to use an algorithm to make medical decisions, it’s important for them to know that the algorithm could be wrong, and how often that is the case. If a company is going to replace a team of data entry people with a computer, it’s important for them to know how often it can make mistakes because that can affect the company’s processes. If we claim a model will only make mistakes 3% of the time, but it actually makes 5%, we might brush that off as a small difference. But that could represent a 60% increase in calls being made to the support department by all the people affected by the algorithms mistakes. That massive increase in cost could completely nullify any benefits of implementing the algorithm in the first place.
Stakeholders need to understand the accuracy of an algorithm, and its typical failure cases, because accuracy has real world implications. The accuracy could affect budgets and balance sheets, the lives and health of real people, and even the outcome of our democracy (when it comes to fact checking algorithms now used by journalists). It could make or break new AI products, and create real disconnects between the engineers who create the technology and the consumers who use it. Making mistakes in the measurement of accuracy could very well mean that lives are lost and new products fail.
Why Measuring Accuracy is Hard
So what is it about measuring accuracy that makes it so difficult? Why has a seemingly easy question to answer become so difficult?
In part 2 of this series, we will go over some of the common problems that show up in measuring accuracy:
- The data your training the algorithm on is not the same as the data its expected to work on in production
- You care about certain types of failures more than other types
- Your model, dataset or measurement may be inherently noisy or stochastic
- Your pipeline may have multiple different points at which accuracy can be measured
- Your model may have several different metrics with different levels of granularity
- You might only have ground-truth data for an intermediate step in the system, but not the final result
- Your dataset might be broken down into different categories with wildly different performance between them
In part 3 of this series, we will address some of the more difficult and interesting problems that we face when measuring accuracy:
- You may not have any ground-truth data for your model
- There may be no off-the-shelf metric which measures accuracy for your model
- There may not be a clear way to define accuracy for your model
- The outcome you actually care about can’t be measured easily
- Measuring accuracy effectively is too computationally expensive
- Your algorithm might be working in tandem with humans
- Your dataset might be changing and evolving, for example if its being actively grown by a data annotation team
- Your problem space might be continuously evolving over time
In light all of these potential problems in measuring accuracy, I’ve come to appreciate a basic piece of wisdom: no matter how you measure it or what you metric you use, we can usually agree on what perfectly correct and completely wrong looks like. It’s everything in between that matters.
Look for Part 2 of this series coming out soon!
Originally published at www.electricbrain.io.





