The world’s leading publication for data science, AI, and ML professionals.

Why We Care About the Log Loss

The most common metric used in Kaggle competitions

The most critical part of a Machine Learning pipeline is performance evaluation. A robust and thorough evaluation process is required to understand the performance and shortcomings of a model.

When it comes to a classification task, log loss is one of the most commonly used metrics. It is also known as the cross-entropy loss. If you follow or join Kaggle competitions, you will see that log loss is the predominant choice of evaluation metrics.

In this post, we will see what makes the log loss the number one choice. Before we start on the examples, let’s briefly explain what the log loss is.

Log loss (i.e. cross-entropy loss) evaluates the performance by comparing the actual class labels and the predicted probabilities. The comparison is quantified using cross-entropy.

Cross-entropy quantifies the comparison of two probability distributions. In supervised learning tasks, we have a target variable that we are trying to predict. The actual distribution of the target variable and our predictions are compared using the cross-entropy. The result is the cross-entropy loss, also known as the log loss.

When calculating the log loss, we take the negative of the natural log of predicted probabilities. The more certain we are at the prediction, the lower the log loss (assuming the prediction is correct).

For instance, -log(0.9) is equal to 0.10536 and -log(0.8) is equal to 0.22314. Thus, being 90% sure results in a lower log loss than being 80% sure.

I explained the concepts of entropy, cross-entropy, and log loss in detail in a separate post if you’d like to read further. This post is more like a practical guide to show what makes the log loss so important.

In a classification task, models usually output a probability value for each class. Then the class with the highest probability is assigned as the predicted class. The traditional metrics like classification accuracy, precision, and recall evaluates the performance by comparing the predicted class and actual class.

Consider the following case.

import numpy as np
y_true = np.array([1,0,0,1,1])

This is a binary classification task and there are 5 observations labeled as 0 or 1. Here are the outputs of two different models.

y_pred1 = np.array(
[[0.2, 0.8],        #predict 1
[0.6, 0.4],         #predict 0
[0.7, 0.3],         #predict 0
[0.65, 0.35],       #predict 0
[0.25, 0.75]])      #predict 1
y_pred2 = np.array(
[[0.1, 0.9],        #predict 1
[0.7, 0.3],         #predict 0
[0.85, 0.15],       #predict 0
[0.55, 0.45],       #predict 0
[0.2, 0.8]])        #predict 1

Although the predicted probabilities are different, the predicted classes are the same when the class with the highest probability is selected.

Thus, if we compare these two models using classification accuracy, the result will be the same for both models.

from sklearn.metrics import log_loss, accuracy_score
y_pred1_classes = np.array([1,0,0,0,1])
y_pred2_classes = np.array([1,0,0,0,1])
accuracy_score(y_true, y_pred1_classes)
0.8
accuracy_score(y_true, y_pred2_classes)
0.8

Both models correctly predicted 4 out of 5 observations so the accuracy is 0.8. It seems like these models perform the same. It’d be a mistake to consider so because there are significant differences between the predicted probabilities.

Let’s now compare them based on the log loss.

log_loss(y_true, y_pred1)
0.4856
log_loss(y_true, y_pred2)
0.3292

As you can see, there is a big difference. It’d be wrong to assume these two models perform the same.

Log loss rewards being more certain at the prediction if the prediction is correct. Log loss of 90% is less than the log loss of 80%.

On the other hand, it also penalizes higher probabilities if the prediction is wrong. Assuming the true class label is 1 for a particular observation. Two models predict the outcome as 0 with probabilities 0.65 (65%) and 0.55 (55%). The log loss for 0.65 is higher than that of 0.55.

Consider the following scenario.

All predicted probabilities are the same except for one observation which is the only wrong prediction. The second model is more certain at a wrong prediction. Let’s compute log losses for this case.

log_loss(y_true, y_pred1)
0.4856
log_loss(y_true, y_pred2)
0.6550

The log loss for the second model is higher because of the higher probability of a wrong prediction.


Log loss takes into account the predicted probabilities. It not only evaluates the performance based on correct predictions but also penalizes wrong predictions according to the predicted probabilities.

This is the reason why log loss is a robust and thorough evaluation metric and widely-used in the field of machine learning.

Thank you for reading. Please let me know if you have any feedback.


Related Articles