Unbalanced Data? Stop Using ROC-AUC and Use AUPRC Instead

Advantages of AUPRC when measuring performance in the presence of data imbalance — clearly explained

Daniel Rosenberg
Towards Data Science

--

Photo by Piret Ilver on Unsplash

The Receiver Operating Characteristic — Area Under the Curve (ROC-AUC) measure is widely used to assess the performance of binary classifiers. However, sometimes, it is more appropriate to evaluate your classifier based on measuring the Area Under the Precision-Recall Curve (AUPRC).

We will present a detailed comparison between these two measures, accompanied by empirical results and graphical illustrations. Scikit-learn experiments are also available in a corresponding notebook.

Preliminaries — Calculating the Curves

I’ll assume you're familiar with precision and recall and the elements of the confusion matrix (TP, FN, FP, TN). In case you need it, the Wikipedia article is a nice refresher.

Now, let’s quickly review the calculations of the ROC curve and the PRC. We will use the illustration below, which greatly contributed to my understanding.

Assume we have a trained binary classifier that predicts probabilities. Namely, given a new example, it outputs its probability for the positive class. Next, we take a test set with 3 positives and 2 negatives and calculate the classifier’s predicted probabilities — we order them in descending order in the figure below. Between adjacent predictions, we place a threshold and calculate the corresponding evaluation measures, TPR (which is equivalent to Recall), FPR, and Precision. Each threshold represents a binary classifier, whose predictions are positive for the points above it and negative for points below it — the evaluation measures are calculated with respect to this classifier. Putting the above on a figure:

Figure 1: Calculating the ROC curve and the PRC, given probabilities and ground truths. The points are sorted by positive-class probability (the highest probability is at the top), and the colors green and red represent a positive or a negative label, respectively. Credit to my colleague.

Using these calculations, we can plot the ROC curve and the PRC:

Figure 2: Plotting the ROC curve and the PRC, given the data depicted in Figure 1.

Calculating the area under each of these curves is now simple — the areas are shown in Figure 2. Note that the AUPRC is also called Average Precision (AP), a term coming from the field of Information Retrieval (more on this later).

In sklearn, these calculations are transparent to us and we can use sklearn.metrics.roc_auc_score and sklearn.metrics.average_precision_score.

Comparing ROC-AUC and AUPRC

Let’s jump straight to the results and discuss the experiment afterward.

In Figure 3, we see two strong models (high AUC), with a minor difference in their AUC scores, such that the orange model is slightly better.

Figure 3: Two seemingly similar models, where the orange one (“Other Model”) displays a slight advantage.

However, in Figure 4, the situation is completely different — the blue model is substantially stronger.

Figure 4: Two models, where the blue one (“Preferred Model”) displays a substantial advantage.

Stronger in what sense? And is it actually interesting, given that the ROC curve tells us a different story? Before we answer these questions, let us describe our experiment.

The key here is the distribution of class labels:

  • 20 positives
  • 2000 negatives

That’s a severe imbalance. Given this data, we simulate predictions of two models. The first model finds 80% of the positives in its top-20 predictions, and the second model finds 80% of the positives in its top-60 predictions, this is illustrated in Figure 5. The rest of the positives are equally distributed among the remaining examples.

Figure 5: Top 100 predictions of the models considered in Figures 3, and 4.

In other words, the difference between the models is how fast they find positives. Let’s see why that is an important property, and why ROC-AUC fails to capture it.

Explaining the Difference

The x-axis for the ROC Curve is the FPR. The change in FPR is slow compared to the change in recall, given unbalanced data. This factor drives all the difference.

To understand this, return to our unbalanced dataset and consider the FPR after seeing 100 examples. We may see a maximum of 100 negatives (false positives) and a minimum of 80, so the FPR is in the interval [0.04, 0.05]. In contrast, our models already achieve 80% recall at 100 examples, leaving little room for improvement in recall, and resulting in a high AUC.

On the other hand, for the PRC, getting a false positive has a significant effect as the precision substantially decreases each time we see one. Thus, the “other model” exhibits poor performance. But why is the precision interesting here?

Consider the tasks of fraud detection, disease identification, and YouTube video recommendations. They share a similar nature of data imbalance as positive examples are rare. However, users of our models will save a lot of time if they find their positives faster. Namely, the scores of the positives are critical, even when the difference is a few spots higher in the ranked list of probabilities. AUPRC captures this requirement, while ROC-AUC fails to do so.

Explaining the Difference — ROC-AUC As a Probability

ROC-AUC has a nice probabilistic interpretation (additional equivalent ones are mentioned in [2], a proof is available in [4], or [5]).

Namely, ROC-AUC is the “the probability that a uniformly drawn random positive has a higher score than a uniformly drawn random negative”.

Let’s think about this interpretation in face of a severe data imbalance like the above. When we uniformly draw a random negative, most likely it's a non-interesting negative or an “easy” one, as this is usually the reason for unbalanced data — negatives are much easier to collect. Thus, when we uniformly draw a random positive it’s trivial to assign it with a higher score than this “easy” negative. Namely, the above probability will be high.

What interests us is how the positives are scored in comparison to the “hard” negatives, these are the few negatives that appear at the top of our predictions. ROC-AUC does not distinguish these negatives, but AUPRC does exactly that.

A Note on Ranking

Classification on unbalanced data may be posed as a positives-retrieval task (e.g., web document retrieval), a scenario where we only care about the top-K predictions from our classifier (or ranker). Measuring the top-K predictions is usually done with Average Precision (AUPRC) as it is the state-of-the-art measure for evaluating general-purpose retrieval systems [3]. Thus, if you find your unbalanced task similar to a retrieval task, considering AUPRC is highly recommended.

The Experiments in Code

To reproduce the results from the article see the following repository. You can also play with the parameters and check how they affect the results.

Conclusion

Even though ROC-AUC encapsulates a lot of useful information for evaluation, it’s not a fit-all measure. We conducted experiments to support this claim and provided a theoretical justification, using ROC-AUC’s probabilistic interpretation. From this, we concluded that AUPRC can provide us with substantially more information when dealing with data imbalance.

Overall, ROC is useful when evaluating general-purpose classification, while AUPRC is the superior method when classifying rare events.

As a side note, we mentioned that classification in the presence of highly unbalanced data is sometimes better posed as a positives-retrieval task. In the next article, we will review ranking measures, which are tailored for such tasks — consider following me if you're interested.

Thanks for reading! I would love to hear your thoughts and comments 😃

--

--

Machine Learning Researcher. MSc in Data Science. Exploring practical novelties in ML.