
The Receiver Operating Characteristic – Area Under the Curve (ROC-AUC) measure is widely used to assess the performance of binary classifiers. However, sometimes, it is more appropriate to evaluate your classifier based on measuring the Area Under the Precision-Recall Curve (AUPRC).
We will present a detailed comparison between these two measures, accompanied by empirical results and graphical illustrations. Scikit-learn experiments are also available in a corresponding notebook.
Preliminaries – Calculating the Curves
I’ll assume you’re familiar with precision and recall and the elements of the confusion matrix (TP, FN, FP, TN). In case you need it, the Wikipedia article is a nice refresher.
Now, let’s quickly review the calculations of the ROC curve and the PRC. We will use the illustration below, which greatly contributed to my understanding.
Assume we have a trained binary classifier that predicts probabilities. Namely, given a new example, it outputs its probability for the positive class. Next, we take a test set with 3 positives and 2 negatives and calculate the classifier’s predicted probabilities – we order them in descending order in the figure below. Between adjacent predictions, we place a threshold and calculate the corresponding evaluation measures, TPR (which is equivalent to Recall), FPR, and Precision. Each threshold represents a binary classifier, whose predictions are positive for the points above it and negative for points below it – the evaluation measures are calculated with respect to this classifier. Putting the above on a figure:

Using these calculations, we can plot the ROC curve and the PRC:

Calculating the area under each of these curves is now simple – the areas are shown in Figure 2. Note that the AUPRC is also called Average Precision (AP), a term coming from the field of Information Retrieval (more on this later).
In sklearn, these calculations are transparent to us and we can use [sklearn.metrics.roc_auc_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)
and [sklearn.metrics.average_precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html)
.
Comparing ROC-AUC and AUPRC
Let’s jump straight to the results and discuss the experiment afterward.
In Figure 3, we see two strong models (high AUC), with a minor difference in their AUC scores, such that the orange model is slightly better.

However, in Figure 4, the situation is completely different – the blue model is substantially stronger.

Stronger in what sense? And is it actually interesting, given that the ROC curve tells us a different story? Before we answer these questions, let us describe our experiment.
The key here is the distribution of class labels:
- 20 positives
- 2000 negatives
That’s a severe imbalance. Given this data, we simulate predictions of two models. The first model finds 80% of the positives in its top-20 predictions, and the second model finds 80% of the positives in its top-60 predictions, this is illustrated in Figure 5. The rest of the positives are equally distributed among the remaining examples.

In other words, the difference between the models is how fast they find positives. Let’s see why that is an important property, and why ROC-AUC fails to capture it.
Explaining the Difference
The x-axis for the ROC Curve is the FPR. The change in FPR is slow compared to the change in recall, given unbalanced data. This factor drives all the difference.
To understand this, return to our unbalanced dataset and consider the FPR after seeing 100 examples. We may see a maximum of 100 negatives (false positives) and a minimum of 80, so the FPR is in the interval [0.04, 0.05]. In contrast, our models already achieve 80% recall at 100 examples, leaving little room for improvement in recall, and resulting in a high AUC.
On the other hand, for the PRC, getting a false positive has a significant effect as the precision substantially decreases each time we see one. Thus, the "other model" exhibits poor performance. But why is the precision interesting here?
Consider the tasks of fraud detection, disease identification, and YouTube video recommendations. They share a similar nature of data imbalance as positive examples are rare. However, users of our models will save a lot of time if they find their positives faster. Namely, the scores of the positives are critical, even when the difference is a few spots higher in the ranked list of probabilities. AUPRC captures this requirement, while ROC-AUC fails to do so.
Explaining the Difference – ROC-AUC As a Probability
ROC-AUC has a nice probabilistic interpretation (additional equivalent ones are mentioned in [2], a proof is available in [4], or [5]).

Namely, ROC-AUC is the "the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative".
Let’s think about this interpretation in face of a severe data imbalance like the above. When we uniformly draw a random negative, most likely it’s a non-interesting negative or an "easy" one, as this is usually the reason for unbalanced data – negatives are much easier to collect. Thus, when we uniformly draw a random positive it’s trivial to assign it with a higher score than this "easy" negative. Namely, the above probability will be high.
What interests us is how the positives are scored in comparison to the "hard" negatives, these are the few negatives that appear at the top of our predictions. ROC-AUC does not distinguish these negatives, but AUPRC does exactly that.
A Note on Ranking
Classification on unbalanced data may be posed as a positives-retrieval task (e.g., web document retrieval), a scenario where we only care about the top-K predictions from our classifier (or ranker). Measuring the top-K predictions is usually done with Average Precision (AUPRC) as it is the state-of-the-art measure for evaluating general-purpose retrieval systems [3]. Thus, if you find your unbalanced task similar to a retrieval task, considering AUPRC is highly recommended.
The Experiments in Code
To reproduce the results from the article see the following repository. You can also play with the parameters and check how they affect the results.
GitHub – 1danielr/rocauc-auprc: Code for corresponding medium article
Conclusion
Even though ROC-AUC encapsulates a lot of useful information for evaluation, it’s not a fit-all measure. We conducted experiments to support this claim and provided a theoretical justification, using ROC-AUC’s probabilistic interpretation. From this, we concluded that AUPRC can provide us with substantially more information when dealing with data imbalance.
Overall, ROC is useful when evaluating general-purpose classification, while AUPRC is the superior method when classifying rare events.
As a side note, we mentioned that classification in the presence of highly Unbalanced Data is sometimes better posed as a positives-retrieval task. In the next article, we will review ranking measures, which are tailored for such tasks – consider following me if you’re interested.
References
- Davis, Jesse, and Mark Goadrich. "The relationship between Precision-Recall and ROC curves." ICML. 2006.
- https://stats.stackexchange.com/questions/132777/what-does-auc-stand-for-and-what-is-it
- Buckley, Chris, and Ellen M. Voorhees. "Evaluating evaluation measure stability." ACM SIGIR Forum. 2017.
- https://stats.stackexchange.com/questions/180638/how-to-derive-the-probabilistic-interpretation-of-the-auc
- https://stats.stackexchange.com/questions/190216/why-is-roc-auc-equivalent-to-the-probability-that-two-randomly-selected-samples
Thanks for reading! I would love to hear your thoughts and comments 😃