Hello there!
Today, we are delving into a specific metrics used for evaluating model performance – the AUC score. But before we delve into the specifics, have you ever wondered why unintuitive scores are at times necessary to assess the performance of our models?
Whether our model handles a single class or multiple classes, the underlying objective remains constant: optimizing accurate predictions while minimizing incorrect ones. To explore this basic objective, let’s first look at the obligatory confusion matrix encompassing True Positives, False Positives, True Negatives, and False Negatives.

For any classification or prediction problem, there are only two outcomes: True or False.
Consequently, every metric designed to gauge the performance of a prediction or classification algorithm is founded on these two measures. The simplest metric that accomplishes this is Accuracy.
Accuracy
In context of classification and prediction accuracy signifies the proportion of correctly predicted instances amongst the total. It’s a very straightforward and intuitive measure of a model’s predictive performance.

However, is accuracy truly sufficient?
While accuracy is a good general measure of a models performance, it’s inadequacy becomes evident when we examine the table below that we will frequently reference in this article. The table shows performance metrics of four models, each with somewhat suboptimal results but, all these models exhibit high accuracy. For instance in the first and second case, there’s a clear bias towards one class, resulting in dismal classification for the less common class yet the accuracy is 90% which is quite misleading.

This helps us conclude:
While accuracy is valuable, it can sometimes mislead, especially in scenarios of class imbalances or when certain errors carry substantial consequences.
For instance, in situations where the cost of missing positive cases (Type 2 error) or falsely identifying negatives (Type 1 error) is high, relying solely on accuracy might not provide a comprehensive assessment of a model’s effectiveness.
The strengths of accuracy lie in its simplicity and its applicability across classes.
Now, having considered accuracy, let’s venture a little deeper into the realm of prediction and classification, several questions emerge:
- What is our objective?
- Is our data balanced?
- Do we prioritize one class over the other?
- Do we lean towards avoiding False Positives (Type 2 error), or do we emphasize minimizing False Negatives (Type 1 error)?
After asking these questions, it seems a little trivial to evaluate model performance based on just accuracy so we turn our attention to three other metrics for evaluating model performance, namely precision, recall and f1-score
Precision
Precision gauges the accuracy with which our model identifies a specific class, in a 2 class scenario that is usually the positive class. It measures the reliability of predictions for the said class.
Consider a scenario where a Machine Learning algorithm predicts loan approvals based on borrower characteristics. While occasional loan denials of eligible candidate (False Negative) might be acceptable to the company, the primary concern is avoiding unwarranted approval of loans to individuals who should not qualify (False Positive).

In that essence, precision aims to minimize Type 2 errors – instances where items are incorrectly accepted when they should be rejected.
Let’s demonstrate this by returning to our table:

We observe here from cases 1 and 3 that a higher precision is achieved when the ratio of true positive to false positive predictions for a specific (positive) class is large irrespective of the actual model performance. Thus,
For a specific class, high precision points towards a low Type 2 error.
Next, we have a counterpart to precision for Type 1 errors:
Recall, Sensitivity, or TPR
Recall just like precision centers on our predictive ability for a specific class. It quantifies how effectively we can accurately select instances belonging to a particular category from the entire pool.
Consider a scenario where our models aim is to prevent credit fraud. We can probably handle cases where a non-fraudulent activity is flagged as fraudulent(False Positive) but we do not want to miss an activity that might actually be fraudulent (False Negative).
In this context, aiming for high recall involves minimizing Type 1 errors – ensuring that you capture as many relevant instances as possible, even if it means flagging a few innocent ones along the way.

Lets return to our table for the third time:

We can see from cases 1 and 4 that, recall excels when positive classifications are maximized. In these cases, even if false positives are present or the negative class’s performance is subpar, recall remains high. Therefore,
High recall for a specific class ensures the minimization of Type 1 errors.
Now, what if we aim to minimize both types of errors for a class?
This is where the F1 score comes into play:
F1-Score
F1-Score is the harmonic mean between precision and recall. F1 score essentially tries to find a balance between precision and recall of a class and in that manner also attempts at a balance between type 1 and type 2 errors.
In essence, he F1 score shows a classification model’s overall effectiveness for a particular class.

Let’s return to our table for the fourth and final perspective:

This time, we observe that the F1 score performs well when both precision and recall excel which is true for case 1. Note that the model is still performing sub-optimally for this scenario but since the number of True Positives is high, and both False Positives and False Negatives are small in number, the score for F1 is high. We can therefore infer,
A high F1 score, being more conservative, for a specific class, minimizes both Type 1 and Type 2 errors.
Having examined the other metrics, it’s evident that each of them possesses certain limitations. Further more unlike accuracy, F1, Precision, and Recall are not class-agnostic, while accuracy remains vulnerable to class imbalances.
Moreover, we never asked the question: "What’s the prediction threshold?" In other words, is there a clear separation between classes? Are positive instances assigned predictions around 0.8–0.9, or are they closer to 0.51?
This is where the Roc Curve and AUC score come into play, albeit with a fair degree of unintuitiveness.
Understanding the AUC Score
The AUC Score, also known as the Area Under the Curve, is a score measured by calculating the area under the Receiver Operating Characteristic (ROC) Curve.
The ROC curve is a plot with Recall/True Positive Rate (TPR) on the y-axis and False Positive Rate (FPR) on the x-axis. The peculiar name of ROC stems from its origin in the field of Electrical Engineering.
In order to construct the ROC curve, FPR and TPR are calculated for various classification thresholds. A classification threshold refers to the prediction value, such as 0.5, above which instances are classified differently than those below 0.5.

However, since creating the complete curve can be cumbersome and does not provide a quantifiable measure, the area under the curve is measured instead.
Below, I show progression of creating a ROC curve by plotting FPR vs TPR values for various classification thresholds. I mark each new point added as red and the prior ones as blue. We can see that upon joining these independent points the curve looks quite a bit similar to the light blue curve in the above image and seems consistent with the AUC score of 0.91 for the model.

The figure also shows how changing thresholds affects other metrics like Accuracy, F1, precision and recall.
The code use to create the above is as follows:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, roc_curve, auc, f1_score, accuracy_score
# Synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Train/test data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# LR Model training
model = LogisticRegression()
model.fit(X_train, y_train)
# Model Predictions
y_probs = model.predict_proba(X_test)[:, 1]
# Generating Random thresholds for creating ROC Curve
num_thresholds = 9
random_thresholds = np.sort(np.random.rand(num_thresholds))
# Visualization
fig, axes = plt.subplots(3, 3, figsize=(16, 16))
axes = axes.flatten()
for i, threshold in enumerate(random_thresholds):
ax = axes[i]
for t in random_thresholds[:i+1]:
y_pred = (y_probs >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
fpr = fp / (fp + tn)
tpr = tp / (tp + fn)
color = 'red' if t == threshold else 'blue'
label = f'Threshold {t:.2f}' if t == threshold else None
ax.scatter(fpr, tpr, color = color, label = label, s = 50)
ax.plot([0, 1], [0, 1], color = 'gray', linestyle = '--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate(FPR)')
ax.set_ylabel('True Positive Rate(TPR)')
ax.set_title(f'Points at Different Thresholds (New Point in Red)nRandom Threshold {threshold:.2f}')
ax.legend(loc = "lower right")
# Calculating AUC Score
fpr, tpr, _ = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)
# Calculating precision, recall, F1 score, and accuracy
precision = tp / (tp + fp)
recall = tp / (tp + fn)
f1 = f1_score(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)
# Displaying other metrics for various classification thresholds
metrics_text = f'AUC: {roc_auc:.2f}nPrecision: {precision:.2f}nRecall: {recall:.2f}nF1 Score: {f1:.2f}nAccuracy: {accuracy:.2f}'
ax.text(0.5, 0.1,
metrics_text,
transform=ax.transAxes,
fontsize=10,
va='bottom',
ha='center')
plt.suptitle('Progression of ROC Curve via estimatimating points at different Classification Thresholds', size = 26, y=1)
# plt.savefig('ROC.png')
plt.tight_layout()
plt.show()
This code snippet showcases the process of creating an ROC curve and calculating the AUC score using Python. It involves generating a synthetic dataset, splitting it into training and testing sets, training a logistic regression model, and then plotting points on the ROC curve for different classification thresholds.
A tangent: Would it matter if instead it was True Negative Rate against False Negative Rate?
Nope, since TNR = 1-FPR and FNR = 1-TPR

What about multiple classes?
Well at a time ROC Curve can be calculated for only two classes so for multiple classes it can get a bit complex but can do one-vs-all for every class.
Now that we’ve seen how it’s constructed let’s go a little bit deeper on how to interpret the AUC score and what does it mean for the model.
Interpreting the AUC Score
We want to look at 4 cases and I demonstrate this via images where everything at the right side of the classification threshold is predicted as positive and everything on the left is predicted as negative. The actual labels for the data are in the respective rectangles representing them.
0 AUC Score
A 0 AUC score would be quite hard to achieve and possibly points towards some serious human error. It means that TPR is 0 at every classification threshold except when FPR is 1. At FPR 1, TPR goes from 0–1 at various classification thresholds this making area under this curve effectively 0. Visually it’ll mean that there’s perfect class separation but every label is reversed.

0.5 AUC Score
Now this is the case where the model didn’t learn anything and classification or prediction is just random. In this case the ROC curve shows a linear relationship or straight proportional line between TPR and FPR. Visually, it would seem that there’s effectively no class separation at all.

0.5 <AUC Score < 1
These are the most common cases if we do things right in our model. An AUC score between 0.5 and 1 means that the model has at least learned something superior to a random classifier but there are still instances in the data that overlap and the model can’t effectively separate the classes completely. The closer the AUC score is to one, the greater is the separation between the classes achieved.

Perfect AUC Score
A perfect AUC score of 1.0 indicates that the model has perfect discrimination ability. However, achieving such a perfect AUC score can indeed be a sign of potential overfitting and unrealistic model behavior, particularly when dealing with real-world datasets and scenarios.
The TPR is effectively 1 for all cases except for when FPR is 0. Eventually leading to an area under curve equalling 1. Visually this means that there is perfect class separation between the two classes and the prediction is also correct.

It’s important to note that, real-world data is often noisy and contains inherent uncertainty. Expecting a model to produce perfect separation can be unrealistic and even if it happens, it’s probably a case of overfitting.
In case, where an unlikely AUC score of 1 is achieved, the model will likely fail to generalize effectively and perform well in practical scenarios due to high probability of overfitting.
A more balanced and robust model is one that achieves a reasonably high AUC score while also allowing for some level of uncertainty in its predictions.
Let’s look at some benefits of using AUC score.
Benefits of AUC Score
There are quite a few benefits to AUC score compared to other metrics
- Class Agnostic: In contrast to metrics like precision, recall, and F1 score, which are dependent on a chosen positive class, AUC provides a more global assessment of a model’s discriminative power, regardless of class distribution.
- Prediction Threshold Agnostic: One of the distinguishing features of AUC score is that it considers the model’s performance across different classification thresholds, providing a comprehensive view of its ability to discriminate between classes.
- Insensitive to Class Imbalance: Since AUC measures how well the model can rank positive and negative instances relative to each other, it is less prone to distortions caused by imbalanced class distributions.
- ROC Curve Threshold Selection: While AUC is threshold-agnostic, the ROC curve itself can help you visually choose a threshold that gives the best performance.
Disadvantages of AUC Score
While there aren’t many disadvantage and this isn’t an exhaustive list, it’s important to note that under extreme imbalance of data AUC score may be affected. Furthermore, AUC treats all misclassifications equally. In many real-world scenarios, the costs and benefits associated with different types of errors can vary. AUC doesn’t take this into account and might not fully represent the performance in cases where one type of error is more critical than another.
Wrapping UP
So let’s wrap up by saying that although AUC score is an excellent measure of model performance each metric has its strengths under the right context.
As we come to an end I hope this was a valuable read and would like you to walk away with clear understanding of what an AUC score is and how to interpret it.
If you found this insightful let me know in the comments. 🙂
Other Resource for AUC Score…
Magician’s Corner: 9. Performance Metrics for Machine Learning Models