Demystifying ROC and precision-recall curves

The receiver operating characteristic (ROC) curve and the precision-recall (PR) curve are two visual tools for comparing binary classifiers. Related to this, the area under the ROC curve (AUC, aka AUROC) and the area under the precision-recall curve (AUPRC, aka average precision) are measures that summarize the ROC and PR curves in single numbers. In this article, we shed some light on these tools and compare them with a focus on imbalanced data (more 1’s than 0’s). In particular, we present arguments that the folklore "the PR curve is preferred over the ROC curve for imbalanced data since the ROC can be misleading or uninformative" contains less truth than often assumed. Whether this is true depends on the specific application context and, in particular, on how these tools are applied. What’s more, the PR curve can equally well disguise important aspects of prediction accuracy and be misleading when there is class imbalance.

The confusion matrix and two types of errors

The ROC and the PR curves are both based on the confusion matrix. Assume that we have a binary classifier (an algorithm or a model), test data of sample size n, and we make predictions using the classifier on the test data. Every data point in the test data is either a 0 ("a negative") or a 1 ("a positive"). This is the ground truth. Further, every data point is predicted as either a 0 or a 1 by the classifier. This gives four combinations:

The true negatives (TN) are the 0’s which are correctly predicted as 0’s
The false positives (FP) are the 0’s which are wrongly predicted as 1’s
The false negatives (FN) are the 1’s which are wrongly predicted as 0’s
The true positives (TP) are the 1’s which are correctly predicted as 1’s

The confusion matrix is a (contingency) table that groups all instances in the test data into these four categories:

Figure 1: confusion matrix - Image by author — Figure 1: confusion matrix – Image by author

A classifier can make two types of errors: false positives ("predicting a 1 when in fact it is a 0") and false negatives ("predicting a 0 when in fact it is a 1"). Depending on the application, both types of errors can have equal importance, or one of these two errors can be more serious than the other. However, one usually does not report absolute numbers of these two types of errors, but rather relative ones. The main reason for this is that relative numbers are better interpretable and comparable. The question is then: "relative to what?" If false positives and false negatives are both equally important and one does not want to distinguish between them, one can simply calculate the error rate = (FP+FN)/n, i.e., the total number of errors FP+FN divided by the total number of samples n.

If one wants to distinguish between false positives and false negatives, the question remains as to which baseline quantities these numbers should be compared? Arguably the most natural reference quantities are the total number of 0’s (= TN + FP) and 1’s (= TP + FN). A way for doing this is to consider the false positive rate = FP / (TN + FP) and the true positive rate = TP / (TP + FN). The false positive rate is the fraction of wrongly predicted 1’s among all true 0’s. The true positive rate, also called recall, is the fraction of correctly predicted 1’s among all true 1’s. Figure 2 illustrates this.

Figure 2: true positive rate (recall) and false positive rate - Image by author — Figure 2: true positive rate (recall) and false positive rate – Image by author

An alternative to comparing the number of false positives to the total number of 0’s is to compare them to the total number of predicted 1’s (= FP + TP) using what is called the precision. The precision = TP / (FP + TP) is the fraction of correctly predicted 1’s among all predicted 1’s. In summary, the main difference between the false positive rate and the precision is to which reference quantity the number of false positives is compared: the number of true 0’s or the number of predicted 1’s. Note that, strictly speaking, the precision compares the true positives to the total number of predicted 1’s. But that’s just the other side of the same coin as 1 – TP / (FP + TP) = FP / (FP + TP). The same holds true for the true positive rate.

Figure 3: precision - Image by author — Figure 3: precision – Image by author

Figure 4: example of a confusion matrix - Image by author — Figure 4: example of a confusion matrix – Image by author

Figure 4 shows an example of a confusion matrix. In this example, the error rate is 0.2, the true positive rate is 0.2, the false positive rate is 0.1, and the precision is 0.25.

False positive rate or precision? A toy example with imbalanced data

Figure 5: confusion matrices of two classifiers for imbalanced data which is easy to classify - Image by author — Figure 5: confusion matrices of two classifiers for imbalanced data which is easy to classify – Image by author

The main difference between the ROC and PR curves is that the former considers the false positive rate whereas the latter is based on the precision. That is why we first have a closer look at these two concepts for imbalanced data. When the number of (true) 0’s is much larger compared to the number of false positives, the false positive rate can be a small number depending on the application. If interpreted wrongly, such small numbers can hide important insights. As a simple example, consider the two confusion matrices in Figure 5 for two classifiers applied to a data set with 1’000’000 points of which 1’000 are 1’s. The two classifiers have a true positive rate of 0.8 and 0.85, respectively. Further, classifier I has 500 false positives, and classifier II has 2000 false positives. This means that the two classifiers have very small false positive rates of approx. 0.0005 and 0.002. In absolute terms, these two false positive rates are quite close together, despite the fact that classifier II has four times more false positives. This is a consequence of the class imbalance in the data and the fact that the data is relatively easy classifiable (=it is possible to have both high true positive rates and low false positive rates). However, classifier I has a precision of approx. 0.62 whereas classifier II has a precision of approx. 0.3. I.e., in terms of the precision, classifier I is clearly better compared to classifier II. The fact that small false positive rates can sometimes disguise differences among classifiers for imbalanced data is at the root of the arguments that support the PR curve over the ROC curve. We will return to this later in the article.

Figure 6: confusion matrices of two classifiers for imbalanced data which is difficult to classify - Image by author — Figure 6: confusion matrices of two classifiers for imbalanced data which is difficult to classify – Image by author

Next, consider a similar data set which, however, is more difficult to classify. The confusion matrices of the two classifiers are shown in Figure 6. Both classifiers have again a true positive rate of 0.8 and 0.85, respectively. Further, the false positive rate of classifier I is approx. 0.4 and the one of classifier II is approx. 0.45. However, the precision of both classifiers is now almost equal and approx. 0.002. The reason why the precision of both classifiers is so small is that there is class imbalance and the fact that the data is relatively difficult to classify. This shows that for data with class imbalance, the precision can also disguise that there are important differences among classifiers. The conclusion of this small example is that whether the precision or the false positive rate are more informative depends on the specific application and not just on whether there is class imbalance or not.

ROC and PR curves

Why curves in the first place?

We could stop here and simply compare the error rate, false positive rate, true positive rate, precision, or any other summary measure based on the confusion matrix. However, in most situations, this is not a good idea. Why not? We have to take a step back and understand how a confusion matrix is created. First, a classifier usually computes a prediction score p for every test point. Often, this is a number between 0 and 1, and sometimes this can also be interpreted as a probability. Second, one chooses a decision threshold δ and predicts all instances with p > δ as 1 and all others as 0. An example of such a threshold is δ=0.5. However, in many applications, there is no stringent argument for using δ=0.5, and better results can be obtained with other δ’s. Among other things, potential reasons for this are (i) that classifiers are often not calibrated (i.e., even though we might think that the output p is a probability, this probability p does not match the actual probability that the event materializes) and (ii) that there is an asymmetry in the losses associated with false positives and false negatives.

For these reasons, one compares classifiers for several or all possible thresholds δ. This is what ROC and PR curves do. The lower (higher) one sets the threshold δ, the higher (lower) is the number of false positives and the lower (higher) is the number of false negatives. I.e., there is a trade-off between having false positives and false negatives.

The ROC curve and the AUC

The receiver operating characteristic (ROC) curve plots the true positive rate versus the false positive rate for all possible thresholds δ and thus visualizes the above-mentioned trade-off. The lower the threshold δ, the higher the true positive rate but also the higher the false positive rate. The closer to the top-left corner a ROC curve is, the better, and the diagonal line represents random guessing. Further, the area under the ROC curve (AUC, aka AUROC) summarizes this curve in a single number. The larger the AUC, the better. The AUC has the interpretation that, for instance, an AUC of 0.8 means that the classifier ranks two randomly chosen test data points correctly with a probability of 80%.

The precision-recall (PR) curve and the AUPRC

The precision-recall (PR) curve plots the precision versus the recall (=true positive rate) for all possible thresholds δ. The goal is to have both a high recall and a high precision. Similarly, there is a trade-off between having a high precision and a high recall: the lower the threshold δ, the higher the recall but also the lower the precision. Further, the area under the precision-recall curve (AUPRC, aka average precision) summarizes this curve in a single number. The higher the AUPRC, the better. Apart from this and in contrast to the AUC, the AUPRC does not have an intuitive interpretation.

"The precision-recall curve is to be preferred over the ROC curve for imbalanced data" – is it that simple?

T[[[here](https://towardsdatascience.com/why-you-should-stop-using-the-roc-curve-a46a9adc728)](https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/)](https://towardsdatascience.com/precision-recall-curve-is-more-informative-than-roc-in-imbalanced-data-4c95250242f6) is a common folklore is that "the PR curve and the AUPRC should be preferred over the ROC curve and the AUC for imbalanced data since the ROC and the AUC might be misleading or uninformative" (see, e.g., here, here, here, or here, to name a few). ** But is it that simple?** In the following, we shed some light on this.

Simulated data experiments

We will use simulated data to explore this argument in detail. Specifically, we simulate 2’000’000 data points with 1% of the data being 1’s (all results remain qualitatively the same when using another class imbalance ratio, e.g. 0.1% 1’s). The data is first generated in a way such that it is relatively easy to obtain good prediction results. We use half of the data for training and the other half for testing. For simplicity, we consider two classifiers which both are logistic regression models using different subsets of the predictor variables. The code for reproducing the simulated experiments of this article can be found here.

Figure 7 shows ROC and PR curves as well as AUCs and AUPRCs. According to the figure, the AUC is higher for classifier 1, whereas the AUPRC is higher for classifier 2. The situation is similar for the ROC and PR curves: according to the ROC curve, one might get the impression that classifier 1 is better, but the PR curve tells the opposite story. But which classifier is "better"?

Figure 7: example of ROC and PR curves for imbalanced data which is easy to predict - Image by author — Figure 7: example of ROC and PR curves for imbalanced data which is easy to predict – Image by author

There is an argument that classifier 2 is indeed better and that the AUC is thus misleading. The argument goes as follows. In the ROC plot, both curves attain relatively quickly a high true positive rate while having a low false positive rate. Likely, we are interested in the area with a small false positive rate, say below 0.2, which is highlighted with a green rectangle in Figure 7. Why only this area? When further decreasing the decision threshold, the true positive rate will increase only marginally whereas the false positive rate will increase considerably. This is a consequence of the fact there is class imbalance and that the data is easily classifiable. For the decision thresholds corresponding to a small false positive rate, classifier 2 is indeed better also when considering the ROC curve.

For imbalanced data, the PR curve and the AUPRC automatically tend to focus more on areas with small false positive rates, i.e., relatively high thresholds δ. That is why according to the PR curve and the AUPRC, classifier 2 is better. However, given that we know a priori that we are interested in small false positive rates, we can also interpret the ROC curve accordingly by focusing only on small positive rates, i.e., by "zooming in" to this area with small false positive rate. The ROC curve in this area is then no longer misleading.

On the other hand, we might in fact be willing to accept a high false positive rate in order to achieve a very high true positive rate. If this is the case, classifier 1, and not classifier 2, is better, and neither the ROC curve nor the AUC is misleading. It is the PR curve and the AUPRC that are misleading in this case. In summary, for exactly the same data, both the AUC and the AUPRC can be misleading, and which one gives a better picture depends on the specific application.

Next, let us look at the same imbalanced data with the only difference being that there is more label noise which means that obtaining accurate predictions is more difficult. The results from this are reported in Figure 8. When looking at the ROC curve and the AUC, it is evident that classifier 1 is better than classifier 2 for this data. However, the PR curves and the AUPRC for the two classifiers are almost indistinguishable. Since the data is difficult to classify, the number of false positives quickly becomes relatively large. This together with the fact that there is class imbalance is the reason why the PR curve and the AUPRC fail to find important differences among the two classifiers.

Figure 8: example of ROC and PR curves for imbalanced data which is difficult to predict - Image by author — Figure 8: example of ROC and PR curves for imbalanced data which is difficult to predict – Image by author

Further issues when comparing ROC and PR curves

1. Are there costs for double-checking predicted positives?

Apart from the above-mentioned facts, another thing to consider in practice is that there are two types of costs that can occur: costs due to false positives themselves and additional costs due to double-checking predicted positives. In situations where costs occur when having false positives while the frequency of false positives among all predicted positives does not matter much since, e.g., no additional costs occur when predicting positives, the false positive rate is more important than the precision. But, there are also applications where the false positive rate is less important compared to the precision since costs occur when predicting positives since, e.g., all predicted positives need to be double-checked.

An example of an application where predicted positives result in no additional costs is SPAM email detection as this is a fully automated task without manual intervention. In fraud detection, on the other hand, one usually does additional checks which often involve human interaction when a classifier predicts a "positive". In this situation, the fraction of false positives among all predicted positives (=precision) is arguably quite important as every predicted 1 directly results in costs.

2. AUC vs. AUPRC: is interpretability important?

When deciding on whether to use the AUC or the AUPRC, a question that also needs to be answered is whether interpretability is important or not? If not, one can "blindly" use these two measures for comparing different classifiers and pick the one with the highest number. If one cares about interpretation, the situation is different. First, apart from the fact that higher AUPRCs are better, the AUPRC does not have an intuitive interpretation as the AUC (see above). Second, the PR curve and the AUPRC ignore the true negatives (e.g., see Figure 3). This means that AUPRCs for different data sets cannot be compared since the AUPRC depends on the base-rate ratio between 0’s and 1’s in the data. This is not the case for the AUC, and AUCs of different datasets are comparable.

3. Are losses quantifiable?

If the losses one occurs when having false positives and false negatives can be quantified, one can use statistical decision theory to determine the optimal decision threshold δ. When having only one single threshold, things simplify: one does not need to use curves such as ROC and PR curves but can use a measure such as the error rate, false positive rate, true positive rate, or precision. Unfortunately, in many situations, these two types of losses cannot be quantified.

4. Both ROC and PR curves can come to the same conclusion

If the ROC curve of one classifier is always above the ROC curve of another classifier, the same also holds true for the PR curve, and vice versa (see, e.g., here for a justification of this). In this case, one classifier is better than the other for all thresholds in both the ROC and the PR space, and it usually does not matter whether one uses the ROC curve / AUC or the PR curve / AUPRC for comparing two classifiers. In general, however, a higher AUC does not imply a higher AUPRC and vice versa.

Conclusion

Saying that the ROC curve and the AUC are misleading or uninformative for imbalanced data implies that only a certain subset of all decision thresholds are of interest: those where the false positive rate is small. Whether this is indeed the case or not depends on the specific application. If it is the case, the AUC can indeed be misleading for imbalanced data which is easily predictable, but the ROC can be simply adjusted by zooming in to the area of interest. Further, the PR curve and the AUPRC can be also misleading or uninformative when there is class imbalance and the response variable is difficult to predict as we have shown above.

The focus of this article was on comparing the ROC curve / AUC with the PR curve / AUPRC for evaluating binary classifiers. The conclusion from this should not be that only one of these tools is to be used, though. The ROC and the PR curves show different aspects, and there is rarely an argument against an additional point of view on a problem at hand (apart from the fact that making a decision about which classifier is better might become more difficult when different points of view disagree). Something else to have in mind is that both the AUC and the AUPCR consider all possible decision thresholds δ while giving different weights to different subsets of these thresholds. However, for some applications, considering all thresholds δ can be unrealistic as some thresholds can be ruled out a priori. Finally, note that both the ROC curve and the PR curve "only" consider the discriminatory ability of classifiers to correctly rank different samples. Sometimes, calibration (="the predicted probabilities do indeed correspond to the probabilities that the predicted events materialize") is also important. If this is the case, other metrics need to be additionally considered.