Cross-Entropy, Negative Log-Likelihood, and All That Jazz

Two closely related mathematical formulations widely used in data science, and notes on their implementations in PyTorch

Remy Lau
Towards Data Science

--

Photo by Claudio Schwarz on Unsplash

TL;DR

  • Negative log-likelihood minimization is a proxy problem to the problem of maximum likelihood estimation.
  • Cross-entropy and negative log-likelihood are closely related mathematical formulations.
  • The essential part of computing the negative log-likelihood is to “sum up the correct log probabilities.”
  • The PyTorch implementations of CrossEntropyLoss and NLLLoss are slightly different in the expected input values. In short, CrossEntropyLoss expects raw prediction values while NLLLoss expects log probabilities.

Cross-Entropy == Negative Log-Likelihood?

When I first started learning about data science, I have established an impression that cross-entropy and negative log-likelihood are just different names of the same thing. That’s why later when I start using PyTorch to build my model, I found it quite confusing that CrossEntropyLoss and NLLLoss are two different losses that do not spit out the same values. After more reading and experimenting, I have a firmer grip on how the two are related as implemented in PyTorch.

In this blog post, I will first go through some of the math behind negative log-likelihood and show you that the idea is pretty straightforward computational-wise! You simply need to sum up the correct entries that encode log probabilities. Then, I will present a minimal numerical experiment that helped me better understand the differences between CrossEntropyLoss and NLLLoss in PyTorch.

If you only want to know the difference between the two losses, feel free to jump right to the very last section on the Numerical Experiment. Otherwise, let’s first get a…

Deep dive into the math!

Maximum Likelihood Estimation

Let us first consider the case of binary classification. Given a model f parameterized by \theta, the main objective is to find \theta that maximizes the likelihood of observing the data.

where y_hat is the predicted probability of the positive class, and \sigma is some non-linear activation function that maps value from (-inf, inf) to [0, 1]. A popular choice of non-linear activation is sigmoid:

Formally, the likelihood is defined as [1]:

Log-likelihood

Note that the powering of y_i and (1-y_i) is nothing more than a clever way to tell you that “we only want to count the prediction values associated with the true labels.”

In other words, to get a feeling of how good our predictions were, we can look at the predicted probabilities assigned to the correct labels. This is better illustrated when taking the log, which turns the product into summations, and results in a more commonly used log-likelihood:

Because we are in a binary classification setting, y takes either zero or one. Hence for each index i, we are either adding the log of y_hat_i or log of (1-y_hat_i).

  • y_hat_i: the predicted probability of the ith data point being positive
  • (1-y_hat_i): the predicted probability of the ith data point being negative

Summing up the correct entries (binary case)

The following animation further illustrates this idea of picking the correct entries to sum. It consists of the following steps:

  1. Start with predicted probabilities for the positive class (y_hat). If we were given raw prediction values, apply sigmoid to make it a probability.
  2. Compute the probabilities for the negative class (1-y_hat).
  3. Compute the log probabilities.
  4. Summing up the log probabilities associated with the true labels.
The computation of binary negative log-likelihood, image by the author (produced with Manim)

In this example, the log-likelihood turns out to be -6.58.

Note that the final operation of picking out the correct entries in a matrix is also sometimes referred to as masking. The mask is constructed based on the true labels.

Minimizing the Negative Log-Likelihood

Finally, because the logarithmic function is monotonic, maximizing the likelihood is the same as maximizing the log of the likelihood (i.e., log-likelihood). Just to make things a little more complicated since “minimizing loss” makes more sense, we can instead take the negative of the log-likelihood and minimize that, resulting in the well known Negative Log-Likelihood Loss:

To recap, our original goal was to maximize the likelihood of observing the data given some parametric settings \theta. The minimizing negative log-likelihood objective is the “same” as our original objective in the sense that both should have the same optimal solution (in a convex optimization setting to be pedantic).

Cross-Entropy

In the discrete setting, given two probability distributions p and q, their cross-entropy is defined as

Note that the definition of the negative log-likelihood above is the same as the cross-entropy between y (true labels) and y_hat (predicted probabilities of the true labels).

This similarity between the two losses causes my initial confusion. Why would PyTorch have two separate functions (CrossEntropyLoss and NLLLoss) if they are the same? As we will see later in a small numerical experiment using PyTorch, the two are indeed very similar. A minor difference is that the implementation of CrossEntrypyLoss implicitly applies a softmax activation followed by a log transformation but NLLLoss does not.

Generalizing to Multiclass

Before going into the numerical experiment to see how some of the loss functions implemented in PyTorch are related, let’s see how the negative log-likelihood generalize to the multiclass classification setting.

Keeping the “masking principle” in mind, we can rewrite the log-likelihood as

where the upper index (not the power) means that we only look at the log probabilities associated with the true labels. In the binary setting,

Recall that y_hat_i is the sigmoid activated f_\theta_i. For simplicity, we refer to f_\theta_i as z_i.

Given the rewritten log-likelihood above, it is tempting to directly apply it to the multiclass (with C classes) setting, where y now takes value from 0 up to C-1. This almost works out except that we need to make sure y_hat_i defines a probability distribution, namely, 1) it is bounded between zero and one, and 2) the distribution sums up to 1. In the binary setting, these two conditions were taken care of by the sigmoid activation and an implicit assumption that “not positive means negative.”

Softmax activation

It turns out that the softmax function is what we are after

In this case, z_i is a vector of dimension C. One can check that this defines a probability distribution as it is bounded between zero and one and is normalized.

Furthermore, it is not hard to see that when C=2, and setting z_i_0 (the prediction score for the “negative class”) to 0, we can fully recover the sigmoid function (try it out!).

Summing up the correct entries (multiclass case)

Now we are ready to apply the masking strategy to the multiclass classification setting to compute the corresponding negative log-likelihood. Similar to before, the steps include

  1. Starting with predicted values (not yet the probabilities) z.
  2. Transform the values into class probabilities (y_hat) using softmax and then take the log probabilities (log y_hat).
  3. Summing up the log probabilities associated with the true labels.
The computation of multiclass negative log-likelihood, image by the author (produced with Manim)

In this example, the log-likelihood turns out to be -6.91.

Numerical experiment

To understand the difference between CrossEntropyLoss and NLLLoss (and BCELoss, etc.), I devised a small numerical experiment as follows.

In the binary setting, I first generate a random vector (z) of size five from a normal distribution and manually create a label vector (y) of the same shape with entries either zero or one. I then compute the predicted probabilities (y_hat) based on z using softmax (line8). In line13, I apply the formula for negative log-likelihood derived in the earlier section to compute the expected negative log-likelihood value in this case. Using BCELoss with y_hat as input and BCEWithLogitLoss with z as input, I observe the same results computed above.

In the multiclass setting, I generate z2, y2, and compute yhat2 using the softmax function. This time, NLLLoss with the log probabilities (log of yhat2) as input and CrossEntropyLoss with the raw prediction values (z) as input yield the same results computed using the formula derived earlier.

Screenshot of the results from the code snippet, image by the author.

For brevity, I only included a minimal set of comparisons here. Check out the full version of the above Github gist for a more comprehensive comparison. There, I have also included a comparison between NLLLoss and BCELoss. Essentially, to use NLLLoss in a binary setting, one needs to expand the prediction values as illustrated in the first animation.

When and which loss to use?

As implemented in PyTorch, the loss functions usually take the form Loss(h, y), where h is either the prediction values or some transformed version of it, and y is the label. Considering only simple cases where h can only be up to two-dimensional, the small experiment above leads to the following recommendations.

Use BCELoss and BCEWithLogitsLoss when

Both h and y are one-dimensional, and y takes either zero or one.

  • Use BCELoss if h is the probability of a data point being positive.
  • Use BCEWithLogits if h is the logits, i.e., you want to use the sigmoid function to activate your raw prediction values into a probability.

Use NLLLoss and CrossEntropyLoss when

h is two-dimensional and y is one-dimensional, taking values of zero up to C-1 with C classes.

  • Use NLLLoss if h encodes log-likelihood (it essentially performs the masking step followed by mean reduction).
  • Use CrossEntropyLoss if h encodes raw prediction values that need to be activated using the softmax function.

What about multilabel classification?

In the multilabel classification setting, a data point could be associated with multiple (or simply none) classes instead of the multiclass case where each labeled data point is associated with precisely one class label. In this case, a common strategy is to treat the problem as multiple binary classification problems, one per class. It turns out that BCELoss and BCEWithLogitsLoss work just fine in this case, so long as the shape of h and y are consistent.

Conclusion

In summary, we see that negative log-likelihood minimization is a proxy problem to find the solution for the maximum likelihood estimation. It turns out that the formulation of cross-entropy between two probability distributions coincides with the negative log-likelihood. However, as implemented in PyTorch, the CrossEntropyLoss expects raw prediction values while the NLLLoss expects log probabilities.

--

--