What is ROC-AUC and when not to use it

No metric is perfect.

Dhruvil Karani
Towards Data Science

--

Original Image

Understand what a metric hides and promotes.

When a measure becomes a target, it ceases to be a good measure — Goodhart’s law

When you improve a metric, you primarily improve what the metric favours. Since no metric can measure everything, chasing one blindly can be dangerous. I want to explore what this means for ROC-AUC.

The receiver operating characteristics curve or ROC curve is used to assess binary classifiers. Binary classifiers output the probability of a data point belonging to the positive class. It is we who apply a threshold for this number and assign a label. If the probability is above the threshold, we say the predicted class is positive (1) else negative (0). Adjusting the threshold depends often on the use case. For cases where the cost of not capturing a positive example is greater than misclassifying a negative example as positive, one might choose a lower threshold and vice-versa.

ROC curve helps us visualize this trade-off across thresholds. The area under this curve (ROC-AUC) is a summary of how well the model separates the positive and negative examples for different thresholds.

Visualizing ROC-AUC in action

The example below will take you through ROC-AUC calculation for a toy example.

Original Image

The figure above shows the ROC curve for a dummy dataset of 7 points, with four negative (blue) and three positive (red) examples. They are arranged in ascending order of the model’s predicted probabilities (0.1, 0.1, 0.2, 0.2, 0.3, 0.6, 0.9). We measure the True Positive Rate (TPR) and False Positive Rate (FPR) at four arbitrarily chosen thresholds, T1, T2, T3 and T4, with values 0, 0.15, 0.25 and 0.95, respectively.

TPR = (+ve examples with a predicted probability above threshold)/(total positive examples)
FPR = (-ve examples with a predicted probability above threshold)/(total negative examples)

As an example, consider T3 (=0.25). The number of positive examples above T3 is 2 (out of 3), and the number of negative examples above T3 is 1 (out of 4). Hence, TPR = 2/3 and FPR = 1/4. Eventually, we can calculate TPR and FPR for each threshold and plot it on a 2-D plane, as shown above. The area under this curve is the ROC-AUC. The more the thresholds, the smoother is the curve and the more accurate the metric is.

Limit of AUC

The maximum achievable AUC is 1. What does this mean?

Original Image

According to the curve, for any non-zero FPR, TPR will always be 1. Conceptually, if you pick a threshold and have just one negative data point crossing it, every positive data point will cross the threshold. This is possible only when there is exists a threshold perfectly separating the classes.

What happens in the case of an imbalanced dataset?

In the case of an imbalanced dataset, ROC-AUC tends to be more optimistic. Consider two sorted datasets (represented as densities),

Original image

As shown, dataset1 (D1) has 50 +ve examples and 100 -ve examples. Assume the overlap of the positive and negative examples is homogenous. Meaning, any two slices of the overlap will have the same proportions of positive and negative examples. The same applies to dataset2 (D2). Evidently, D2 is more imbalanced than D1.

For D1, at threshold = 0.5,
— Precision = 50/(50+50) = 0.5.
— Recall = 50/50 = 1.
— F-1 score = (2*Precision*Recall)/(Precision+Recall)=(2*0.5*1)/(0.5+1) = 2/3.

D2 at 0.75 threshold, F1 score(=2/3).

Now let’s do something interesting. Let’s look at the AUCs.

Original image

For D1, as you go from threshold 0 to 0.5, the TPR remains constant (=1). Similarly, for D2 TPR=1 for threshold 0 to 0.75.

For D1, post 0.5 FPR, the TPR and FPR decrease linearly since they are homogeneously mixed. The same goes for D2, post 0.25 FPR. But AUC for D2 is greater than D1 even though they have the same maximum F1 score. Observe that for most thresholds, the TPR in the imbalanced dataset is higher.

This can be deceiving. So what’s the alternative?

Precision-Recall Curves

precision-recall curves are like ROC curves, except the Y-axis is the precision, and the X-axis is the recall. AUC is computed by plotting the curve at different thresholds. Consider the PR curves for the same example discussed above —

For D1, any threshold after 0.5 and for D2, any threshold after 0.75 has precision=0.5. For D1, any threshold before 0.5 and for D2, any threshold before 0.75 has recall=1. So as you see, the curves are essentially the same. 1/3 is the precision for D1, and 1/5 is the precision for D2 at threshold=0. The example we have used is hugely oversimplified. However, PR-AUCs are less sensitive to imbalance.

General Advice

It is important to how a metric behaves. Model evaluation almost never stops at a single number metric. To understand what the model has learnt, look at what examples did the model get wrong? what examples was the model not confident about? where are the most gains in the metric coming from?

I love teaching! If you are someone who needs help with hard to understand concepts in data science, machine/deep learning or Python, feel free to request a tutorial class here.

--

--