Interpreting ROC Curve and ROC AUC for Classification Evaluation

How I wish I was taught ROC Curve when I first learned it

Vinícius Trevisan
Towards Data Science

--

Original from Will Francis on Unsplash

One really strong way to evaluate the prediction power of a classifier is by plotting the ROC (Receiver Operating Characteristic) Curve.

This is well-known, but do you know how to interpret ROC Curves?

ROC Curve Intuition

This curve shows us the behavior of the classifier for every threshold by plotting two variables: the True Positive Rate (TPR) and the False Positive Rate (FPR).

The True Positive Rate is often known as Recall / Sensitivity and defined as:

While the False Positive Rate is defined as:

On the image below we illustrate the output of a Logistic Regression model for a given dataset. When we define the threshold at 50%, no actual positive observations will be classified as negative, so FN = 0 and TP = 11, but 4 negative examples will be classified as positive, so FP = 4, and 15 negative observations are classified as negative, so TN = 15.

Adapted from Google Developers

As we move the threshold to 75%, only positive observations will be classified as positive, so TP = 7 and FP = 0, while all negative observations will be classified as negative and TN = 19. We still have 4 positive observations classified as negative, so FN = 4.

We can calculate the TPR and FPR for each of these thresholds and compare them:

The best threshold depends on the objective of the model. If it is more important to have all positives classified as positive, even if it means classifying some negatives as positive, the 50% threshold is better (see the example below).

1) Car breakdown prediction — High Recall, Low Precision

Assume that you work for a car manufacturer that collects data from cars, and your model tries to predict when a car will break, so that the customer is warned to make a visit to the repair shop for a check-up.

In this scenario you might want a high recall, wich means that all owners of cars with potential flaws will be warned to check it up. However, by maximizing the recall we might also send warning to cars that are not likely to break soon (False Positives), thus reducing the precision. The owner of a False Positive car will face a minor inconvenience of going to the repair shop only to find out that his car is fine, but on the other hand, most cases of cars that might break (and even cause accidents, maybe) are covered.

We reduce FN (and raise the recall) but increase FP (and lower the precision).

Now if we wish to have a model with high confidence on every observation classified as positive, even if it means misclassifying a few positive observations as negative, the 75% threshold is the best choice (see the stock picking example below).

2) Stock Picking prediction — Low Recall, High Precision

Here you are a stock market trader, and wish to build a model to help you picking stocks. This model will classify as positive a stock with high probability of yielding good returns.

On this scenario you wish to buy only the best stocks, because your money is limited and you do not want to assume much risk. This is the case you want to raise the precision and only pick the stocks most likely to yield returns, even if it means that some good ones may be left out (False Negatives).

By picking only the best ones we reduce the False Positives (and raise the precision) while accepting to increase the False Negatives (and reducing the recall).

Interpreting the ROC Curve

The intent of the ROC Curve is to show how well the model works for every possible threshold, as a relation of TPR vs FPR. So basically to plot the curve we need to calculate these variables for each threshold and plot it on a plane.

On the plots below, the green line represents where TPR = FPR, while the blue line represents the ROC curve of the classifier. If the ROC curve is exactly on the green line, it means that the classifier has the same predictive power as flipping a coin.

Image by author

On the left plot the blue line is relatively close to the green one, which means that the classifier is bad. The rightmost plot shows a good classifier, with the ROC curve closer to the axes and the “elbow” close to the coordinate (0,1). The middle one is a good enough classifier, closer to what is possible to get from real-world data.

Another way to interpret the ROC curve is by thinking about the separation of the classes, and we can illustrate that with histograms, as below.

Image by author

The bad classifier (left) has too much overlap of the classes and therefore is unable to make good predictions, and no threshold is able to separate the classes. As expected, the good classifier (right) has almost no overlap, so we can easily find a good threshold to separate the predictions in their right classes. Finally the middle one is on the middle ground: there is some overlap, but good results can be achieved by setting the threshold accordingly.

ROC AUC

Now you know how useful ROC Curves are, but how to evaluate them? The answer is: Area Under Curve (AUC).

The AUROC Curve (Area Under ROC Curve) or simply ROC AUC Score, is a metric that allows us to compare different ROC Curves.

The green line is the lower limit, and the area under that line is 0.5, and the perfect ROC Curve would have an area of 1. As closer as our model’s ROC AUC is from 1, the better it is in separating classes and making better predictions.

We can use sklearn to easily calculate the ROC AUC:

from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_real, y_pred)
print(f"ROC AUC: {score:.4f}")

The output is:

ROC AUC: 0.8720

When using y_pred, the ROC Curve will only have “1”s and “0”s to calculate the variables, so the ROC Curve will be an approximation. To avoid this effect and get more accurate results it is advisable to use y_proba and get the probabilities of class “1”, when calculating the ROC AUC:

score = roc_auc_score(y_real, y_proba[:, 1)
print(f"ROC AUC: {score:.4f}")

The output is:

ROC AUC: 0.9271

Plotting the ROC Curve from scratch

I believe the best way to understand a concept is by experimenting with that, so let’s learn how to plot the ROC Curve from scratch. Later on I will show how to easily do that with the sklearn library.

You can find the code available on my github repository, so feel free to skip this section.

First we need to train a classifier model in the dataset:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
# Split train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
# Create the model object
model = GaussianNB()
# Fit the model to the training data
model.fit(X_train, y_train)
# Predict the classes on the test data
y_pred = model.predict(X_test)
# Predict the classes on the test data, and return the probabilities for each class
y_proba = model.predict_proba(X_test)

Then we define a function to calculate TPR and FPR for each instance, based on the equations presented before.

from sklearn.metrics import confusion_matrixdef calculate_tpr_fpr(y_real, y_pred):
# Calculates the confusion matrix and recover each element
cm = confusion_matrix(y_real, y_pred)
TN = cm[0, 0]
FP = cm[0, 1]
FN = cm[1, 0]
TP = cm[1, 1]
# Calculates tpr and fpr
tpr = TP/(TP + FN) # sensitivity - true positive rate
fpr = 1 - TN/(TN+FP) # 1-specificity - false positive rate

return tpr, fpr

We want to evaluate TPR and FPR for every threshold, so we define a function that will create “n” thresholds and iterate over them calculating the variables and storing them in a list. Those will be the coordinates of the ROC Curve points.

In a binary classifier the predictions can be either “0” or “1”, and moving the threshold will have no effect. To ensure we can have the correct curve we need to use the probabilities of classifying each observation in class “1”, and we get those probabilities with the model.predict_proba(X_test) method.

def get_n_roc_coordinates(y_real, y_proba, n = 50):
tpr_list = [0]
fpr_list = [0]
for i in range(n):
threshold = i/n
y_pred = y_proba[:, 1] > threshold
tpr, fpr = calculate_tpr_fpr(y_real, y_pred)
tpr_list.append(tpr)
fpr_list.append(fpr)
return tpr_list, fpr_list

Lastly, we can use seaborn to plot the points and the curve, by passing the tpr and fpr lists to the function below:

import seaborn as sns
import matplotlib.pyplot as plt
def plot_roc_curve(tpr, fpr, scatter = True):
plt.figure(figsize = (5, 5))
if scatter:
sns.scatterplot(x = fpr, y = tpr)
sns.lineplot(x = fpr, y = tpr)
sns.lineplot(x = [0, 1], y = [0, 1], color = 'green')
plt.xlim(-0.05, 1.05)
plt.ylim(-0.05, 1.05)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")

The result is a rather good ROC Curve, using lines as approximations for the segments without calculated coordinates.

# Calculates 10 coordinates of the ROC Curve
tpr, fpr = get_n_roc_coordinates(y_test, y_proba, resolution = 10)
# Plots the ROC curve
plot_roc_curve(tpr, fpr)

Plotting the ROC Curve with Scikit-Learn

Surely you won’t build the ROC Curve from scratch every time you need that, so I will show how to plot it with scikit-learn.

Check how simple it is:

from sklearn.metrics import roc_curve
from sklearn.metrics import RocCurveDisplay
def plot_sklearn_roc_curve(y_real, y_pred):
fpr, tpr, _ = roc_curve(y_real, y_pred)
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
roc_display.figure_.set_size_inches(5,5)
plt.plot([0, 1], [0, 1], color = 'g')
# Plots the ROC curve using the sklearn methods - Good plot
plot_sklearn_roc_curve(y_test, y_proba[:, 1])
# Plots the ROC curve using the sklearn methods - Bad plot
plot_sklearn_roc_curve(y_test, y_pred)

The roc_curve function calculates all FPR and TPR coordinates, while the RocCurveDisplay uses them as parameters to plot the curve. The line plt.plot([0, 1], [0, 1], color = 'g') plots the green line and is optional.

If you use the output of model.predict_proba(X_test)[:, 1] as the parameter y_pred, the result is a beautiful ROC curve:

But if you use directly the output of model.predict(X_test), the method won’t have all the necessary information to build all the points, and the plot will be an approximation of two line segments:

Conclusion

The conclusion of this article is: ROC AUC is ultimately a measure of the separation between classes in a binary classifier. I wish this was how it was explained to me on the start of my journey as a data scientist, and I hope this will make a difference for all the readers of this article.

If you like this post…

Support me with a coffee!

Buy me a coffee!

And if you like this subject, take a look on my article explaining the use of ROC Curves for multiclass classification:

--

--