Performance Metrics: Receiver Operating Characteristic (ROC)- Area Under Curve (AUC)

Understanding ROC — AUC performance metric with an example

Vaibhav Jayaswal
Towards Data Science

--

Image by Issac Smith, from Unsplash

ROC Curve

The Receiver Operating characteristic (ROC) curve is explicitly used for binary classification. However, it can be extended for multiclass classification.

In binary classification, when a model gives probability scores as output, we use 0.5 as a threshold for the simplest model. If the probability of a query point is greater than 0.5, the model will classify it as class 1 (say positive) otherwise class 0 (negative). To measure the performance of the model, we can use accuracy, confusion matrix, precision, recall, and F1 score.

A question that arises here is using 0.5 as a threshold will give significant results for every model?

NO

Selecting a threshold that produces significant results is problem specific. For example, in cancer detection problems, if we keep a low threshold, more people will be predicted as positive by the model. Moreover, if we set an incredibly high threshold, there is a possibility of missing actual patients.

How should we decide the appropriate threshold value?

ROC is one technique that can give the best threshold value.

Let’s take a trivial example

STEPS

  1. Take unique probability scores (in descending order) as a threshold and predict the class labels. If we have k unique probability scores, there will be k thresholds.
  2. For each threshold, we measure the class labels for all the query points.

For each prediction, we calculate the true positive rate (TPR) and false positive rate (FPR).
Let’s look at the confusion matrix.

Confusion Matrix

The True Positive rate measures the true positive out of total actual positives (True positive (TP) + False Negative (FN)).

False Negative rate measures False-positive out of total actual negatives (True Negative (TN) + False Positive (FP)).

If we have k unique probability score, we will have k different predictions. Therefore, we will have k pairs of (TPR, FPR).
In the above example, we have 5 predictions, so there will be 5 TPR, FPR pairs.

PREDICTION 1 (threshold = 0.95)

Confusion matrix for prediction 1 (threshold = 0.95)

TPR1 = 1/(1+2) = 1/3
FPR1 = 0/2 = 0

PREDICTION 2 (threshold=0.92)

Confusion matrix for prediction 2 (threshold = 0.92)

TPR2 = 2/(2+1) = 2/3
FPR2 = 0/2 =0

PREDICTION 3 (threshold = 0.7)

Confusion matrix for prediction 3 (threshold = 0.7)

TPR3 = 3/3 = 1
FPR3 = 0/2 = 0

PREDICTION 4 (threshold = 0.6)

Confusion matrix for prediction 4 (threshold = 0.6)

TPR4 = 3/3 = 1
FPR4 = 1/(1+1) = 1/2

PREDICTION 5 (threshold = 0.46)

Confusion matrix for prediction 5 (threshold = 0.46)

TPR5 = 3/3 = 1
FPR5 = 2/2 = 1

3. Plot the FPR vs. TPR curve with TPR on the y-axis and FPR on the x-axis.

ROC curve with 5 (FPR, TPR) pairs. The number in parenthesis denotes the prediction number.

Blueline shows the random model, which means the model is giving a random output (unable to classify the query points). The appropriate threshold will be the point where TPR is maximum, and FPR is minimum because we want true positive and true negative to be more than a false positive and false negative.

For our example, the best threshold value will be of prediction 3, i.e., 0.70.

AUC

The area under the TPR-FPR curve will give an idea about the effectiveness of the model. The higher the AUC score, the better is the model. AUC scores are used to compare different models.

The maximum value of AUC can be 1. WHY?

Because the maximum value of TPR and FPR is 1, respectively. So, the maximum area will be 1.

The classifier is correctly predicting both the classes. In the above example, we have a perfect classifier as the area under the orange curve is 1.

What can we infer if AUC =0.5?

This means the classifier is unable to separate both the classes. The model is giving random output.

What can we infer from AUC<0.5?

That means the model predicts the opposite value, i.e., the actual positive label is being predicted as negative and vice versa.

The below image shows two models. How do we decide which one is better?

ROC curve of 2 models

Model 2 area under the curve (AUC) is more than that of Model 1. Therefore, Model 2 is a better classifier than Model 1.

A limitation of AUC scores is they are sensitive to imbalanced data. If the model is dumb, AUC scores can be high.

Conclusion

The Receiver Operating Characteristic (ROC) curve is used to determine the appropriate threshold for the models, which give probability scores as output in binary classification. Area under the curve (AUC) scores are used to compare different models. However, they can be impacted by the imbalanced dataset.

Thanks for reading!

--

--