An Intuitive Explanation of Cross Entropy

Lili Jiang
Towards Data Science
5 min readJan 18, 2019

--

First, we need to understand the concept of entropy (largely borrowing from Khan Academy’s excellent explanation).

Let’s play games.

Game 1:

I will draw a coin from a bag of coins: a blue coin, a red coin, a green coin, and an orange coin. Your goal is to guess which color it is with fewest questions.

One of the best strategies is this:

Each coin has a 1/4 of probability of getting chosen, and takes 2 questions to guess right. So the expected number of questions to guess the coin is 2.

Game 2:

Now, I will draw a coin from a bag of coins: 1/2 of them are blue, 1/4 are red, 1/8 are green, and 1/8 are orange. The previous strategy no longer is the best; because there is a fair chance to draw a blue coin, we should prioritize guessing the most likely outcome. Your optimal strategy now looks like this:

1/2 of the time, it is blue, and it takes 1 question to guess it. 1/4 of the time, it is red, and it takes 2 questions to guess it. By this logic, the expected number of questions to guess a coin is 1/2 × 1 (blue) + 1/4 × 2 questions (red) +1/8 × 3 questions (green) + 1/8 × 3 questions (orange) = 1.75.

Game 3:

To contrast it with an extreme example, I am drawing from a bag of all blue coins. Trivially, it takes 0 questions to guess the color of my coin. Or… log(1) = 0 question if we write it in a funny way. Note, it takes 0 questions only if you know that it is a bag of blues.

Interestingly, a pattern emerges: a coin with p probability takes log (1/p) questions to get it right. For example, when p = 1/4, log(4) = 2 questions (all logarithms are based 2 in this post). So in total, the expected number of questions for this game is

. And that is the expression for entropy. Intuitively, it is the expected number of questions to guess the color under the optimal strategy for this game. The more uncertain the setup is (Game 1 > Game 2 > Game 3), the higher the entropy is.

Now, let’s move on to cross entropy.

For Game 2, if we still use the strategy from Game 1,

Then, 1/8 of the time, the coin is orange, and it takes 2 questions to get it right. 1/2 of the time, it’s blue but it still takes 2 questions to get it right. On average, it takes 1/8 × 2 + 1/8 × 2 + 1/4 × 2 + 1/2 × 2 = 2 questions to get it right, instead of 1.75 using the optimal strategy we discussed earlier. So using Game 1 strategy to Game 2 is a worse strategy, and 2 is the cross entropy for using this strategy on this setup.

Thus, cross entropy for a given strategy is simply the expected number of questions to guess the color under that strategy. For a given setup, the better the strategy is, the lower the cross entropy is. The lowest cross entropy is that of the optimal strategy, which is just the entropy defined above. This is why in classification problems in machine learning, people try to minimize the cross entropy.

More formally, cross entropy is

, where p_i is the true probability (for example, 1/8 for orange and green, 1/4 for red and 1/2 for blue) and \hat{p_i} is the wrongly assumed probability (for example, using strategy 1, we are assuming p = 1/4 for all colors). It may be easy to confuse whether it’s p or \hat{p} that is in the log. Under this explanation, it is easy to remember: log is used to calculate the number of questions under your strategy, so what is under the log is your predicted probability, \hat{p}.

So, in a decision tree, if your tree is not constructed in a best way, you are basically wrongly assuming the probability distribution of the outcomes, and the cross entropy is high.

Cross entropy is not just about decision tree though; it applies to all classification problems. In a binary classification problem, the likelihood that the label y is 1 is your predicted y, \hat{y}; the likelihood that y is 0 is 1 -\hat{y}. So we can write the likelihood we want to maximize in a clever way:

. When y is 1, the second term in the product is 1 and we want to maximize \hat{y}; when y is 0, the first term in the product is 1 and we want to maximize 1 — \hat{y}. This only works if y takes on value of 0 or 1 only.

Maximizing the log of the likelihood is equivalent of minimizing

, and this is just the expression for cross entropy. This is why cross entropy is called log loss. Minimizing cross entropy maximizes the log likelihood. As an example, I have three data points in my classification, their true label is 1, 1, 0, and my predicted y is 0.8, 0.9, 0.3. Then the average cross entropy is

If I predict perfectly to be 1, 1, 0, then the cross entropy is 0. (Well, technically, log is not well defined at 0, but we can ignore that term.) Using the coin game analogy, here, each prediction for a sample y is an individual coin guessing game with Game 3 setup. The first sample y=1 is like drawing a coin from a bag that contains only “y = 1”. Now, if you are a ninja guesser (aka a perfect algorithm), then you also know it must be a bag of all ones. So your \hat{p} = 1. Same for the second and third sample. Thus, the cross entropy is 0 for a perfect algorithm.

Originally published at www.quora.com.

--

--