Background – Simple on the Surface
The metrics used for gauging performance of classification models are fairly straightforward, at least from a mathematical standpoint. Nevertheless, I have observed that many modellers and data scientists encounter difficulty articulating these metrics, and even apply them incorrectly. This is an easy mistake to make, as these metrics appear simple on the surface, yet their implications can be profound depending on the problem domain.
This article serves as a visual guide to explaining common classification model metrics. We will explore definitions and use examples to highlight where metrics are used inappropriately.
A Brief Note on Visualisation
Each visualisation comprises of ninety subjects, representing anything we might wish to classify. Blue subjects denote negative samples, whilst red are positive samples. The purple box is the model which attempts to predict positive samples. Anything inside this box is what the model predicts as positive.
With that clarified, let’s delve into the definitions.
Precision & Recall
For many classification tasks there is a trade-off between precision and recall. It’s frequently the case that optimising for recall incurs a cost to precision. But what do these terms actually mean? Let’s begin with the mathematical definitions, and then move onto the visual representations.
Precision = TP/ (TP + FP)
Recall = TP/(TP + FN)
Where TP = Number of true positives, FP = Number of false positives, FN = Number of false negatives.
Let’s focus on the chart directly below in which there are four positive samples. Remember, the model’s positive predictions are represented by the box on the chart. Observing the chart, we see the model correctly predicts all four positive samples— we can see this as all the positive samples sit within the box. We can calculate model recall from the chart by counting positive cases within the box (TP = 4) divided by the total number of positive cases (TP = 4 + FN = 0).
Note FN is 0 because there are no positive cases outside of the box.
Precision can be explained similarly. It is simply the number of positive cases in the box (TP=4) divided by the total number of cases in the box (TP = 4 + FP = 6). A straightforward calculation reveals the model’s precision to be just 40%.
You can observe that a model can have a high recall but low precision, and vice versa. The chart below shows this, where recall is just 50%, while precision is 100%. See if you can internalise how to get to these numbers.
Here is a clue to help you, the number of false negatives is two since there are two positive samples outside of the box.
False Positive Rates & True Negative Rates
The false positive rate (FPR) perhaps appears more intuitive, possibly due to its name. However, let’s explore the concept in the same way we did for the other metrics. Mathematically, we express the FPR as follows:
FPR = FP/(FP + TN)
Here, TN represents the number of true negative subjects.
Examining the first image again, the FPR can be determined by looking at the number of negative samples inside the box (FP=6) divided by the total number of negative samples (FP=6 + TN=80). For our first image, the false positive rate is just 7%, and for the second, it’s 0%. Try figuring out why this is the case.
Remember, the subjects inside the box are those that the model predicts are positive. So by extension, the negative samples outside the box are those that the model has identified as negative.
The true negative rate (TNR) can be computed using the following formula:
TNR = TN/(TN + FP)
Notice that TNR is always one minus the false positive rate.
Accuracy
Accuracy is a term that is loosely thrown around in the context of model performance, but what does it actually mean? Let’s start with the mathematical definition:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Using the same logic we previously applied, we can calculate the model’s accuracy as 93% for the first image and 97% for the second (see if you can derive this for yourself). This might be raising red flags in your mind as to why accuracy can be a deceptive metric in some cases. We will explore this in greater detail next.
Using Metrics Correctly
Why do we concern ourselves with these metrics? Because they equip us with ways to assess the performance of our models. Once we comprehend these metrics, we can even determine the commercial value associated with models. This is why it is important to have a good intuition around there appropriate (and inappropriate) use. To illustrate this, we will briefly investigate the two common scenarios in classification tasks, namely balanced and imbalanced datasets.
Imbalanced Datasets
The diagrams depicted earlier are instances of imbalanced classification tasks. Put simply, imbalanced tasks have a low representation of positive subjects compared with negative subjects. Many commercial use cases for Binary Classification fall into this category like credit card fraud detection and customer churn prediction, spam filtering etc. Selecting the incorrect metrics for imbalanced classification can lead you to have over-optimistic beliefs about the performance of your model.
The primary issue with imbalanced classification is the potential for the number of true negative samples to be high, and false negatives to be low. To illustrate, let’s consider another model and assess it on our imbalanced data. We can create an extreme scenario where the model simply predicts every subject as negative.
Let’s compute each of the metrics in this scenario.
- Accuracy: (TP=0 + TN=86)/(TP=0 + TN=86 + FP=0 + FN=4) = 95%
- Precision: (TP=0) /(TP=0 + FP=0) = undefined
- Recall: (TP=0) / (TP=0 + FN=4) = 0%
- FPR: (FP=0) / (FP=0 + TN=86) = 0%
- TNR: (TN=86) / (TN=86 + FP=0) = 100%
The issues with accuracy, FPR and TNR should start to become more apparent. When we are working with imbalanced datasets, we can produce a high-accuracy model that performs poorly upon deployment. In the previous example, the model has no capacity to detect positive subjects but still achieves an accuracy of 95%, A 0% FPR and a perfect TNR.
Now, imagine deploying such a model to conduct medical diagnostics or detect fraud; it would quite evidently be useless and perhaps even dangerous. This extreme example illustrates the problem of using metrics such as accuracy, FPR, and TPR to assess the performance of models working on imbalanced data.
Balanced Datasets
For balanced classification problems, the number of potential true negatives is significantly smaller than in the imbalanced case.
If we take our "non-discriminatory" model and apply it to the balanced case, we obtain the following results:
- Accuracy: (TP=0 + TN=45) / (TP=0 + TN=45 + FP=0 + FN=45) = 50%
- Precision: (TP=0) / (TP=0 + FP=0) = undefined
- Recall: (TP=0) / (TP=0 + FN=45) = 0%
- FPR: (FP=0) / (FP=0 + TN=45) = 0%
- TNR: (TN=45) / (TN=45 + FP=0) = 100%
While all of the other metrics remain the same, the model’s accuracy has declined to 50%, arguably a much more indicative representation of the model’s actual performance. Albeit accuracy is still deceptive without precision and recall.
ROC Curves vs. Precision-Recall Curves
ROC curves are a common approach used to evaluate the performance of binary Classification Models. However, when dealing with imbalanced datasets, they can also provide over-optimistic and not entirely meaningful results.
A brief overview of [ROC](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20(receiver%20operating,False%20Positive%20Rate) and Precision-Recall curves: we are essentially plotting the classification metrics against each other for different decision thresholds. We commonly measure the area under the curve (or AUC), to give us an indication of the models performance. Follow the links to learn more about ROC and Precision- Recall Curves.
To illustrate how ROC curves can be over optimistic, I have built a classification model on a credit card fraud dataset taken from Kaggle. The dataset comprises 284,807 transactions, of which 492 are fraudulent.
Note: The data is free to use for commercial and non-commercial purposes without permission, as outlined in the Open Data Commons license attributed to the data.
Upon examining the ROC curve, we might be led to believe the model performance is better than it actually is, since the area under this curve is 0.97. As we have previously seen, the false positive rate can be overly optimistic for imbalanced classification problems.
A more robust approach would be to utilise the precision-recall curve. This provides a much more robust estimate of our model’s performance. Here we can see the area under the precision-recall curve (AUC-PR) is much more conservative at 0.71.
Taking a balanced version of the dataset where fraudulent and non-fraudulent transactions are 50:50, we can see that the AUC and AUC-PR are much closer together.
The notebook for generating these charts is available in my GitHub repo.
There are ways to uplift the performance of classification models on imbalanced datasets, I explore these in my article on synthetic data.
Conclusion
Understanding classification model metrics goes beyond the mathematical formulae. You should also understand how each metric should be used, and their implications for both balanced and imbalanced datasets. As a rule of thumb, metrics that are calculated based on true negatives or false negatives may be over optimistic when they are applied to imbalanced datasets. I hope this visual tour has given you more of an intuition.
I found this visual explanation handy for articulating the approach to my non-technical stakeholders. Feel free to share or borrow the approach.
Follow me on LinkedIn
Subscribe to medium to get more insights from me:
Should you be interested in integrating AI or Data Science into your business operations, we invite you to schedule a complimentary initial consultation with us: