Applications of Different Parts of an ROC Curve

Understanding the importance of different parts of an ROC curve and exploring variants of AUC for ML applications

Prince Grover
Towards Data Science

--

Image by author

Introduction

Receiver Operating Characteristic (ROC) curve is one of the most common graphical tools to diagnose the ability of a binary classifier, independent of the inherent classification algorithm. The ROC analysis has been used in many fields including medicine, radiology, biometrics, natural hazards forecasting, meteorology, model performance assessment, and other areas for many decades and is increasingly used in machine learning and data mining research [1]. If you are a Data Scientist, you might be using it on a daily basis.

ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) based on the binary outcome at various model score settings. An ideal classifier would give a very high TPR value at a very low FPR (i.e. it would correctly identify positives without mis-labelling negatives). But, an ideal model rarely exist and thus, the goal of trying out different models and machine learning experiments is to identify an optimal model that would give the best cost (FPR)/benefit (TPR) trade-off. The ROC analysis helps achieving that goal.

The total area under the ROC curve (AUC)

Generally, (and in “almost all” of the cases), we use total area under the ROC curve (AUC) as the evaluation criteria to find the optimal classifier. AUC is also the evaluation criteria used in most of the Kaggle competitions. This is because AUC is a single number that provides the average information about the overall model’s performance at various threshold settings. An interesting property about AUC is that it is independent of class distribution, i.e. whether class distribution is 10%/90% or 50%/50%, ranking of models based on AUC would be same. (Note that I mentioned “interesting”, it is not always a good property. Usefulness depends on the problem at hand. In some cases, this property can be bad and Data Scientists switch over to using area under Precision-Recall curve instead. But let’s table that aside). AUC is informational and useful in most of the cases, but there are many real world problems where it fails to provide the right focus on the ability of a model that needs to be understood before putting in production. In many practical applications (examples discussed below), major parts of an ROC curve are little to no use, but AUC summarizes the whole curve giving equal importance to all regions, including the irrelevant ones.

Commonly used solution to tackle the above mentioned challenges with AUC is to fix an FPR (or TPR) cut-off point and find the model that maximizes TPR (or minimize FPR) at that cut-off. But fixing cut-off values on either of the axis require concrete requirements from the business side and has higher variance in the output values. Not an ideal solution that ignores most of the operating range of an ROC curve. It is also difficult to look at the grid of various metrics to compare models, e.g. TPR @ 1% FPR, 2%FPR, 10% FPR, 20% FPR etc.

In simple words, we are saying that the full AUC has problems in some situations and looking at fixed set of cutoff points are not ideal solutions to these problems. The suggested solution lies in somewhere in between, i.e. rather than using all possible FPR/TPR values, we aggregate on the set of points that matter. This can be done by partial area under the ROC curve.

In the remaining post, let’s learn about different variants of partial area metrics discovered in literature, how to calculate each of the variant and when to use them. But first, let’s start with some examples where AUC would fail to emphasize on the need of partial area metrics.

Examples of problems with full AUC

  1. When class distribution is highly imbalanced (e.g. 0.1%/99.9%, use-cases: fraud detection, cancer detection etc.), different classifiers even though would be ranked correctly exhibit very small differences in AUC values, making the model comparisons difficult. For example, you might be comparing numbers like 0. 9967 vs. 0.9961 in AUC and think that the two models under comparison are pretty similar. In reality, these two models might have significant difference when used to make actual decisions.
  2. If there is only 1 positive and 999 negatives, even a decent model would be able to catch 100% positive (i.e. 1) by mis-classifying a small % of negatives (FPR, let’s say 20%). In such cases, comparing area at high FPR range (e.g. >20%) is useless as all models attain TPR of 100% in a high FPR region. But AUC still uses that non useful region of high FPR. In such scenarios, ideally we want to find models that can identify a large % of positives by mis-classifying very low % of negatives.
  3. In clinical practice, diagnostic tests with high FPR result in significant economical expense, as a great proportion of non-diseased candidates would exhaust the scarce resource of medical therapies. In addition, when diagnosing a lethal disease, failing to correctly identify severe diseased subjects (low TPR) will cause serious ethical consequences. As a result, FPR and TPR need to be simultaneous maintained at low and high levels respectively; so that uneconomical and unethical regions are ruled out from AUC. [2]
  4. In detection of benign vs. malignant cancer using mammograms, it is more important to identify malignant cancer than causing unnecessary biopsies for benign conditions. i.e. high TPR ≥ 0.9 is region of interest. Full AUC might not show significant differences between different models, even though they different in the regions of interest.
  5. Another classic example of caveat of AUC is comparison of two classifiers that have very different ROC curves crossing each other. One has high TPR and the other has high True Negative Rate (TNR), but both have same AUC. In such scenario, AUC fails to choose a better classifier. [6]

Side note: By the way, here is an interesting trivia and origins of the word “receiver-operator” in the ROC curve. ROC was first used during WWII to analyze radar signals, more specifically by US Army to detect Japanese aircraft from their radar signals. By looking at a screen, radar receiver operators had to manually decide whether a blip on the screen represented an enemy aircraft or some noise. Their ability to make these distinctions was called the Receiver Operating Characteristics.

ROC Curve (Image by author)
Different segments of an ROC curve (Image by author)

ROC curves on the left represent performance of sample classifier trained to classify binary outcomes. The AUC of this classifier is 0.969.

Note: AUC ranges from 0 to 1 and is symmetric across 0.5. i.e. AUC of 0.5 is the worst classifier. A classifier with AUC between 0 and 0.5 is predicting the opposite class than what is being evaluated.

Now we will see how we can use different regions (like ones highlighted in left) of the same ROC curve to calculate different partial metrics.

Variants of partial area under the ROC curve

Partial area metrics are not new. Some papers I have referenced at the end were published back in 80s. My goal with this post is to revisit them and bring up to your attention about common scenarios that you can come across when dealing with real production problems, especially outside the world of Data Science competitions.

1. pAUCn

Definition: Normalized area under ROC curve below certain FPR. [4]

pAUCcn @ <10% FPR. Green area divided by red area. (A/B) (Image by author)

Pros: Focuses on the need of high TPR in the leftmost part of an ROC curve; region that matters most for highly imbalanced data.

Cons: pAUCn lacks the symmetricity that is enjoyed by AUC around 0.5. pAUCn does not have a fixed lower bound as the lower bound changes with the chosen FPR threshold.

2. sPA

Definition: Standardized partial area under ROC curve above major diagonal. [7]

sPA @ <10% FPR. 0.5 * (1 + A/(A+B)) (Image by author)

Pros: Focuses on the need of high TPR in the leftmost part of an ROC curve. It is scaled to give value of 0.5 for random classifier, i.e. gives a fixed lower bound irrespective of FPR cut-off.
Cons:
sPA also lacks symmetricity and is not meaningful below 0.5. Its lower bound is not 0.

3. PAI

Definition: Partial Area Index. Area under ROC curve above certain TPR. In other words, average value of TPR over all values of FPRs between TPR cut-off and 1. [3]

PAI @ >70% TPR. Green area divided by red area. (A/B) (Image by author)

Pros: Focuses on the need of high TPR irrespective of FPR. In some clinical practices like early stage cancer detection, is it ok to tolerate high FPR if it means better capture rate.

Cons: Lower bound is not fixed and varies with TPR cut-off. Lacks symmetricity around 0:0.5 and 0.5:1 to other partial area measures.

4. tpAUCn

Definition: Two way partial area under ROC curve below certain FPR and above certain TPR. [2]

tpAUCn @ <10% FPR & >70% TPR. Green area divided by red area. (A/B) (Image by author)

Pros: Allows simultaneous control of TPR and FPR. Many use cases require high TPR and low FPR. e.g. diagnostic of lethal disease that uses expensive testing instruments.

Cons: User should know about both TPR and FPR cut-offs.

5. pAUCcn

Definition: Concordant partial area under the ROC curve. Half the sum of vertical partial area under ROC curve (pAUCn) and horizontal partial area under ROC curve (pAUCxn). [5]

pAUCcn @ >10% FPR & <20% FPR. Green area divided by red area (Image by author)

Pros: Scaled similar to sPA, gives 0.5 for random classifier. Unlike other partial area measures, this one holds interpretations that AUC enjoys like relation to the average TPR, average FPR and c statistic.

Cons: Its lower bound is not 0.

[Post Author note: The knowledge of pros and cons of pAUCcn needs to be updated based on most recent paper by the authors of Concordant partial area under the ROC curve. Based on the new normalization method, the cons are not valid and has pAUCcn has symmetry similar to AUC. I will make the necessary changes soon.]

6. HAUC

Definition: Half-AUCs, AUC measured in two halves, when TPR > TNR or when TPR < TNR. Useful to compare classifiers when it is known whether one needs superior sensitivity (TPR) or superior specificity (TNR). Sum of two halves is equal to full AUC. [6]

HAUCsp: A+0.5*C, HAUCse: B+0.5C. HAUCsp + HAUCse = AUC (Image by author)

Pros: No decision on FPR or TPR cut-off is required. Cut-offs are automatically derived from points in ROC curve where FPR = 1 - TPR. Only decision required is whether you need high TPR or high TNR. Range from 0 to 0.5 with 0.25 random.

Cons: Not useful when cut-off point is known and is different than where FPR = 1 - TPR.

Comparison Study

I motivate the usefulness of partial AUCs with one comparison study on a synthetic data. There can be several other examples that I haven’t demoed in this post (to keep it short). The examples include A) comparison of two classifiers with high AUCs but different performance at low FPRs, B) classifiers with same AUCs but different performance at high TPRs, C) cross-over ROC curves with same AUC but different level of sensitivity or specificity, and more…

Two classifiers with high and similar AUCs but different performance at low FPRs

In the following example, I prepared a synthetic dataset using make_classification method of sklearn. To emphasize the scenario of data with highly imbalanced label distribution, I kept label ratio as 99%/1%. Then I trained two classifiers, one trained with LogisticRegression (LR) and other trained with CatBoostClassifier (CB).

ROC Curves on linear and log scales. Log scale emphasize the differences at low FPR. (Image by author

We see that the AUC for both LR and CB are pretty close (99.78% vs. 99.96%). In fact, ROC curve on linear scale appear to be very similar. One might think that CB does well ranking wise but not significantly better than LR. But seeing the same curve on log scale (right side) emphasizes the differences at low FPRs (<1%). Let’s see what would a partial area give.

Emphasis on partial area under curve below certain FPR. FPR at log scale. (Image by author)

pAUCn at the cut-off of 1% FPR shows a considerable difference in performance between LR and CB (89.9% vs. 97.9%), showing that CB does much better.

An example (other than previously mentioned cancer detection) of such a scenario is training a classifier for account compromise detection from online authentications. Think of a model that detects whether a sign-in is from an attacker (compromised) or from the original account holder. Given, a huge scale of online activities, very small % of sign-ins would actually be compromised (e.g. 1 in 10,000). You could use the model to put high risk sign-ins to the queue for human investigations. Due to a large imbalance, at an FPR of as low as 1%, you would be investigating 100 good sign-ins in an effort to find that 1 compromise. Due to expenses incurred in investigations, your business requirements might cap you to only investigate X% of all sign-ins, rendering area above ~X% FPR as useless from business perspective.

Notebook with codes to create plots and calculate partial metrics

Conclusion

AUC is often used as a summary metric that indicates the overall performance of a binary test, in terms of its accuracy at various thresholds for TPR and FPR. In simpler words, it is average TPR over all possible FPRs and average FPR across all possible TPRs (it is symmetric in nature). AUC is informational and useful in most of the cases, but there are many real world problems where a lot of parts of an ROC curve are not of interest, and AUC fails to focus on the region that matters. Sometimes, your region of interest is below certain FPR, sometimes above certain TPR, or other times bounded by both FPR and TPR. Sometimes specificity (TNR, 1-FPR) is more important than sensitivity (TPR), other times vice-versa. In all those cases, the partial AUC measures do the right thing of focussing on the regions of the ROC curve that matter. This post provides some of those partial area measures from literature, helps to visualize them and summarizes their pros and cons.

Review credits: Shikhar Gupta

References

Metrics I have mentioned above are pretty much picked from several papers around partial AUCs written in the past.

  1. Wikipedia: https://en.wikipedia.org/wiki/Receiver_operating_characteristic
  2. Yang H, Lu K, Lyu X, Hu F. Two-way partial AUC and its properties.
  3. Jiang Y, Metz CE, Nishikawa RM. A receiver operating characteristic partial area index for highly sensitive diagnostic tests.
  4. Walter SD. The partial area under the summary ROC curve.
  5. Carrington et al. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithm.
  6. Bradley AP. Half-AUC for the evaluation of sensitive or specific classifiers.
  7. Ma et al. On use of partial area under the ROC curve for evaluation of diagnostic performance.

--

--

Applied Scientist at Amazon. Previously - Machine Learning Engineer at Manifold.ai, USF-MSDS and IIT-Roorkee Alumnus (Twitter: @groverpr4)