7 Things You Should Know about ROC AUC

Several caveats of the popular performance metric

Published in

Towards Data Science

10 min readSep 18, 2019

Models for different classification problems can be fitted by trying to maximize or minimize various performance measures. Measurements that address one aspect of a model’s performance but not another are important to note so that we can make an informed decision and select the performance measures that best fit our design.

ROC AUC is commonly used in many fields as a prominent measure to evaluate classifier performance, and researchers might favor one classifier over another due to a higher AUC.

For a refresher on ROC AUC, a clear and concise explanation can be found here. If you are totally unfamiliar with ROC AUC you may find that this post digs into the subject a bit too deep, but I hope you will still find it useful or bookmark it for future reference.

Most of the material presented here is based on a paper by [Lobo et al., 2008] where the authors illustrate several issues regarding the usage of ROC AUC to evaluate the performance of classification models.

We will go over several concerns we should be aware of when using ROC AUC and look at some examples to gain a better understanding of them.

Apples are not oranges

At first glance, it seems that a single number (ROC AUC) which is calculated using (among other things) the decision functions of two classifiers can indeed be used to compare them. This idea is based on the implicit assumption that the AUC for both classifiers was derived in a way which is independent of the classifiers decision function output (i.e., scores) distribution.
However, in [Hand, 2009] the author shows that this is not the case:

The AUC evaluates a classifier using a metric which depends on the classifier itself. That is, the AUC evaluates different classifiers using different metrics.

And further provides the following analogy:

It is as if one measured person A’s height using a ruler calibrated in inches and person B’s using one calibrated in centimeters, and decided who was the taller by merely comparing the numbers, ignoring the fact that different units of measurement had been used.”

In a nutshell — the AUC is an averaged minimum loss measure, where the misclassification loss is averaged over a cost ratio distribution which depends on the score distribution of the classifier in question.

In other words, we can calculate the AUC for classifier A and get 0.7 and then calculate the AUC for classifier B and obtain the same AUC of 0.7, but it does not necessarily mean that their performance is similar.

The curious reader is encouraged to read [Hand, 2009] where he will find a very good intuitive explanation of the problem as well as a rigorous mathematical analysis followed by a suggested solution.

Probability values are ignored

Let us compare two hypothetical binary classification models, fitted on the same sample from the data:

Model A predicts that many of the positive examples have a probability of ~0.55 to be positive and many of the negative examples have a probability of ~0.45.
Model B predicts that many of the positive examples have a probability of ~0.85 to be positive and many of the negative examples have a probability of ~0.25.

Both models can have a very similar AUC, but model B clearly does a much better job at separating the positive examples from the negative ones.
Choosing a different sample and refitting our models could produce different results, and model B’s superior ability to separate the classes could be made more obvious.

If we rely only on AUC to assess model performance we might think model A and B are very similar, when in fact they are not.

“Area under the curve” does not equal “Area of interest”

When evaluating ROC curves there are two regions which describe the model’s performance under “extreme” thresholds:

The left-hand side of the curve where we have a small true-positive rate, as well as a small false-positive rate.
The right-hand side of the curve where we have a large true-positive rate, as well as a large false-positive rate.

We would clearly not prefer a model just because it has a large AUC in those regions, but AUC is just a single number which also includes the area under the ROC curve in those regions.

Is model A better than B?

Both models have very similar AUC, but model A is more consistent in terms of true-positive rate vs. false-positive rate (for all thresholds), while for model B the ratio between the true-positive rate and the false-positive rate is highly dependent on the threshold selection — it is much better for lower thresholds.

Which is more important? true-positive rate or false-positive rate?

In some cases minimizing the false-positive rate is more important than maximizing true-positive rate, and in some cases, the opposite might be true. It all depends on how our model will be used.

When the AUC is calculated, both false and true positive rates are equally weighted, and therefore it cannot help us select the model which fits our specific use case.

Which model is better, A or B?

This depends on our domain and the way we intend to use the model.

Considering the ROC curve for model A, if we decide we must have a true-positive rate of at least 60%, we will have to accept that the model will also have a false-positive rate of 30%.
Considering the ROC curve for model B we can achieve a true-positive rate of at least 60% and a false-positive rate of 20%.

If minimizing the false-positive rate is the most important measure in our case, then model B is preferable to A, even though they have a very similar AUC.

Distribution of model errors over the features’ range

Let us compare two simple binary classification models that use a single feature x to predict a class y.
Assume both models achieve the same accuracy.

The misclassification errors of model A occur for small values of x as well as for large values of x.
The misclassification errors of model B occur mostly for low values of x and rarely for large values of x.

Both models can have a very similar AUC, but:

Model B is much better at predicting y for large values of x.
Model A has a more consistent performance over the range of x

The AUC alone will not enable us to know there’s such a difference in performance between the models, and we will not know that their errors are distributed differently.

Evaluating performance for imbalanced classes

more often than not we encounter data where the classes are imbalanced.
Consider two confusion matrices obtained from fitting the same model on two different samples from the data:

Note: AUC is calculated using a single threshold of the model as described in [Sokolova & Lapalme, 2009]:
AUC = 0.5*(Sensitivity + Specificity).

Our confusion matrices are as follows:

Confusion matrix A is the result of evaluating the model on a sample where the positive examples constitute 10% of the data.
Confusion matrix B is the result of evaluating the same model on a sample where the positive examples constitute 3% of the data.
We achieved this by doubling the number of negative examples and halving the number of positive examples

The performance metrics are:

Confusion matrix A —
Precision: 0.5
Recall: 1
F1-score: 0.666
AUC: 0.5625
Confusion matrix B —
Precision: 0.2
Recall: 1
F1-score: 0.333
AUC: 0.5625

In both cases we obtain the same AUC , but the change in other measurements (e.g., F1-score) show that our model’s performance varies according to the proportion of positive examples, while AUC is invariant under the above conditions — multiplying the negative and positive rows by different scalars [Sokolova & Lapalme, 2009].

Once again, the AUC alone does not provide all the information we need to evaluate a model’s performance.

Small-sample precision

In [Hanczar et al., 2010] the authors perform a simulation study as well as an analysis of real data and find that ROC-related estimates (the AUC being one of them) are fairly bad estimators of the actual metrics. This is more prominent in small samples (50–200 examples) and when the classes are imbalanced it gets even worse.

In other words, when we evaluate a classifier’s performance using AUC, we do so to try and estimate the classifiers true AUC when it will be used “in the wild” on real data (i.e., when our classifier will be live in production). However, the AUC that we calculate (for small samples) is a bad estimator — it is too far from the true AUC and we should be very careful not to trust it.

Just to be clear — If we calculate the AUC based on a sample which contains 200 examples and we obtain an AUC of 0.9, the true AUC can be 0.75, and we could not know that (at least without looking at confidence intervals or some other tool which will enable us to gauge the estimator’s variance).

How should we use ROC AUC to measure performance?

While some performance measures are more easily interpretable (Precision, Recall, etc.) ROC AUC is sometimes regarded as a magic number that somehow quantifies all we need to know about a model’s performance.

As we noted, ROC AUC has various issues which we need to be aware of if we choose to use it. If, after considering those issues, we still feel that we would like to use ROC AUC to evaluate classifier performance, we can do so. We can use any measure we want, as long as we are fully aware of its limitations and drawbacks — Just as we use Recall and remember that it is invariant with respect to the number of false-positives we can also use ROC-AUC while keeping its limitations in mind.

It is important to emphasize that one should not use only a single metric to compare classification models performance. In this regard, ROC AUC is no different than Precision, Recall or any other of the common metrics. To evaluate performance in a well-rounded manner we would do best to consider several metrics of interest, all the while being aware of their characteristics.

Summary

We covered several issues regarding ROC AUC:

The AUC is dependent on the intrinsic properties of the classifier, which makes comparing classifiers based on AUC an inappropriate measure for comparison (in many common cases).
AUC does not reflect the underlying probability values predicted by the classifier.
AUC is calculated by considering all possible score thresholds, whether or not we would choose to use those thresholds.
False-positives and true-positives are equally weighted, whether it suits us or not.
AUC does not provide information about the classifier error distribution.
AUC is invariant with respect to the rate of positive examples in the data.
AUC is unreliable for small samples.

ROC AUC is similar to other measures in that it has its pros and cons. Since there are many resources which speak of its pros, this post focused on some of its other characteristics. These are properties of the ROC AUC measure, not necessarily cons. Each measure is variant under some conditions and invariant under others (as you can see in [Sokolova & Lapalme, 2009]), and whether those properties are beneficial or unfavorable is up the the user to decide.

It would be better to use several metrics to compare classifiers, and not just a single one, and when using ROC AUC one could also look at the ROC curve which can provide valuable information [Fawcett ,2004].

That’s it

Thank you for reading! I hope you found this post useful. If you have any questions or suggestions please leave a comment. All forms of feedback are most welcome!

References

Lobo, J. M., Jiménez‐Valverde, A., & Real, R. (2008). AUC: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography, 17(2), 145–151.
Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine learning, 77(1), 103–123.
Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information processing & management, 45(4), 427–437.
Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., & Dougherty, E. R. (2010). Small-sample precision of ROC-related estimates. Bioinformatics, 26(6), 822–830.
Macskassy, S., & Provost, F. (2004). Confidence bands for ROC curves: Methods and an empirical study. Proceedings of the First Workshop on ROC Analysis in AI. August 2004.
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1–38.
Ferri, C., Flach, P., Hernández-Orallo, J., & Senad, A. (2005, August). Modifying ROC curves to incorporate predicted probabilities. In Proceedings of the second workshop on ROC analysis in machine learning (pp. 33–40). International Conference on Machine Learning.