Tips and Tricks, PRACTICAL INSIGHTS
What should you do to avoid misuse when assessing the performance of a classifier.

Would you consider a classifier with 99% accuracy to be good or bad? A good chunk of people would assume that the correct answer is "good". It’s shocking!. The correct answer is neither.
Let that sink in.
I have been in numerous meetings where people assume that a model must be good if the associated performance metric is above a specific value. As a data scientist, you should know better.
Let me explain. No data scientist should ever have, without context, any threshold for any performance metric in mind to suggest whether it is good or not. Data science is applied science. It is heavily context-dependent. I will illustrate this with two widely used performance metrics: accuracy and the area under the curve of Receiver Operating Characteristics (ROC). Both of these metrics are widely used to assess the performance of a classifier.
When you finish reading this article, you will clearly understand why the correct answer to the opening question is neither.
Use and Misuse of ROC
The Use
I assume you already understand what an ROC does. I will briefly introduce it here. However, I have a more detailed introduction if you are new or a little rusty.
ROC curve is one of the most widely used performance metrics to assess a classifier’s performance.
Briefly, ROC plots sensitivity on the y-axis and 1-specificity on the x-axis. Sensitivity provides us an estimate of the proportion of positive classes that are correctly identified. Specificity provides us a measure of the proportion of negative cases that are correctly identified. There is a trade-off between sensitivity and specificity depending on the choice of threshold.
Rather than deciding on what threshold to use, we could consider all possible pairs of sensitivity and specificity by varying the threshold from one to the other extreme. The ROC plot does just that. It lets us plot all possible pair values of sensitivity and specificity and creates a "curve".
A single metric that can summarise the overall performance rather than looking at trade-offs for each threshold is the AUC (Area Under the Curve). ROC is great for comparing classifiers and then choosing the classifier with the higher AUC for further consideration (see the illustration below).

The Misuse
However, this is where the problem begins. We not only use ROC for selecting a better model now but, more often than not, we also use it as the final part of our Data Science story. ROC metrics should not be presented as the final result of a data science project.
This is because an ROC represents a model operating at all possible thresholds. However, in real practice, we can only run a model with a single threshold for classification.
Using ROC as your final part of the analysis is akin to saying that we don’t know which threshold could work so let’s, instead, consider all possibilities, plot the results, and then summarise it with AUC and leave it at that.
Sounds lazy, isn’t it?
This approach is adequate for choosing one model over another but insufficient to figure out if the proposed model could solve the problem at hand. It needs further thinking and analysis which a data scientist is not formally trained in (and you won’t find it in a typical course).
You see, people are commonly taught that having an AUC of more than 0.5 means you are doing better than chance. While this is true, this doesn’t make a model useful.
A model would be useful if it’s better than the status quo. The status quo is not always a random classifier.
In a situation with no existing solution, an AUC of 0.68 could be useful. One practical example of this is our previous work on developing an early warning system for patients with a chronic disease (Chronic Obstructive Pulmonary Disease for anyone interested to dig further). This system was meant to help patients manage their condition themselves at home. Because the status quo in this particular case was not doing anything, a model with an AUC of 0.68 is useful.
However, a model with an AUC of 0.80 that provides early warning to help doctors better manage patients in hospitals is useless if the existing solution provides better sensitivity and specificity.
The bottom-line? You can’t figure out whether a model with a given AUC is good or useless without any context. A model with an AUC of 0.80 could be useless, and a model with an AUC 0f 0.6 could be good, depending on the context.
You will likely work with all sorts of people, from domain experts to software developers to marketing to sales. ROC is a great tool but it is not simple for beginners. As a data scientist, you should first be clear on what a good AUC is in the given context.
If a meeting ever regresses to fixating on a specific AUC value, you better be ready with evidence to explain that assuming a specific value of AUC to be gold standard is plain nonsense.
Use and Misuse of Accuracy
Accuracy is perhaps the most intuitive performance metric for everyone to comprehend. A classifier with 99% accuracy implies that, for every 100 outputs, the classifier is correct 99 times. Accuracy is typically reported in percentage. We are primed to believe that any accuracy that is more than 50% is good, and certainly, any accuracy that is in the 90s must be amazing. Again, this is a fallacy that needs correction.
Rare Disease Case
Let’s imagine a simple, hypothetical scenario. There is a disease outbreak that infects lots of people. However, it is only deadly in 1% of those infected and for the remaining 99%, there is no impact on health whatsoever.
Our task is to build a classifier that can identify that 1% of the people in whom the infection is deadly so that they can take a protective pill. We can’t give pills to everyone due to side effects.
Let’s assume that we have training data of 10,000 people where 100 people sadly died. One completely useless classifier could be one that says that nobody will die from the infection. Such a useless classifier will still have an accuracy of 99%.
There is your answer to the opening question. In the above scenario, the classifier with 99% accuracy was just useless. We can’t tell if a classifier with 99% accuracy is good or bad unless the context is provided.
The rare disease scenario is not uncommon in the real world. There are several data science applications where the classes are imbalanced and at times, acutely imbalanced.
What Should You Do If Classes are So Imbalanced?
Instead of using accuracy as a metric, you need to adopt alternative metrics that focus on the rare class. A precision/recall curve is appropriate in such cases. Precision measures the proportion of actual positive cases that are identified from all the cases that were predicted to be positive (positive predictive value). Recall measures the proportion of positive cases that are correctly identified (same as sensitivity). A single summary metric that combines both precision and recall is the F1 score (a weighted average of precision and recall).
There are several other approaches one can take and precision/recall is just one of many ways.
Final Thoughts
This post was not meant to provide you with an exhaustive list of all performance metrics (and dwell on the use and misuse of them). However, this post showed the two most widely known ones to illustrate how these are misunderstood and misused.
The onus is on us, the data scientists, to pick the right performance metric in a given context and then elaborate on what the results imply and how they compare with any existing solution.
It’s infinitely better to use just one or two metrics that can then be contextualized rather than present a laundry list of performance metrics with no rhyme or reason.
We will come across people who may not necessarily understand the nuances of the data and we may get away with cherry-picking a metric for the wrong reason but…. should we ever do that?
"Cowardice asks the question, ‘Is it safe?’ Expediency asks the question, ‘Is it politic?’ Vanity asks the question, ‘Is it popular?’ But, conscience asks the question, ‘Is it right?’ And there comes a time when one must take a position that is neither safe, nor politic, nor popular, but one must take it because one’s conscience tells one that it is right" Martin Luther King Jr.
Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)