The world’s leading publication for data science, AI, and ML professionals.

Developing and Explaining Cross-Entropy from Scratch

Read on to understand the intuition behind cross-entropy and why machine learning algorithms try to minimize it.

Photo by Jørgen Håland on Unsplash
Photo by Jørgen Håland on Unsplash

Cross-entropy is an important concept. It is commonly used in Machine Learning as a cost function – often our objective is to minimize the cross-entropy. But why are we minimizing the cross-entropy, and what does cross-entropy really mean? Let’s answer these questions.

First, we need an adequate understanding of the concepts of information and entropy. If you want a thorough understanding of these concepts, I wrote a detailed article about them here. To summarize, information is a measure of how surprising/low probability an event is. The lower the probability (more surprising), the higher the information. Information can also be interpreted as how many bits it takes to represent an event. The formula is I = -log_2(p) bits, where I is information and p is probability. Entropy is the average over all possible events, so the formula for entropy is E = -∑ p_i * log_2(p_i).

With this context, we can move on to cross-entropy. The question cross-entropy asks is: what happens to the information/bit representation length when I substitute a different probability distribution in place of the true probability distribution? Let’s look at a specific example to see what this means.

Say we have a fair coin. The true probability distribution is ½ heads, ½ tails. The information for the heads and tails events are the same: I = -log_2(1/2) = 1 bit. The entropy is the average of the information of both events, so it is also 1 bit. Now let’s assume that for some reason we think the coin isn’t fair (even though it actually is). We think that the probability distribution is ¼ heads, ¾ tails. From our perspective, heads is more rare than it actually is, and tails is more common. Therefore, we think heads contains more information than it really does, and tails contains less. The exact amount of information from our perspective is I_heads = -log_2(1/4) = 2 and I_tails = -log_2(3/4) = 0.42.

However, the true probability distribution is still ½ heads, ½ tails. Because of this, the expected information from a flip, from our perspective, will be ½ I_heads + ½ I_tails = ½ 2 + ½ 0.42 = 1.21. This number is the cross-entropy: the average information calculated by assuming a different probability distribution than the true probability distribution. If the true probabilities are denoted p_i and the assumed probabilities are denoted as q_i, the formula for cross-entropy is CE = -∑ p_i * log_2(q_i).

Let’s look at this example from another angle to build more intuition. Instead of flipping a coin, consider an alphabet with just two letters: A and B. Both me and my friend use this alphabet, but I like the letter A more, and he likes the letter B more. Let’s say my letter usage of A to B is 80/20, and my friend’s is 20/80. I want to come up with an efficient binary code to represent this alphabet. Because I use the letter A far more than B, I will represent A with less bits and B with more bits. My friend also comes up with a code, but since he uses B more than A, his code uses more bits for A and less bits for B.

Now, consider what happens when I try to use my code to represent what he says (his language). Because he says B a lot, I have to use my high-bit representation for B many times, and I don’t get many chances to use my low-bit A representation. Therefore, my code’s overall bit usage (on my friend’s language) will be much higher than his own code’s (which has a low-bit B and a high-bit A). Similarly, my friend’s code will be very inefficient in representing my language. So what’s the point of this example? The number of bits my code needs for my friend’s language is cross-entropy. The number of bits his code uses for his language is entropy. As we see, cross-entropy is higher than entropy.

We can also change the letter usage distributions. For example, let’s say my letter usage distribution is 60/40 A/B instead of 80/20, and my friends is 40/60 instead of 20/80. Now, since our distributions are closer, using my optimal code to represent my friend’s language isn’t as inefficient as before. In other words, the cross-entropy is smaller. You can see this for yourself by plugging in the cross-entropy formula we stated above.

This suggests an important usage of the cross-entropy: comparing two probability distributions. The closer two distributions are, the smaller the cross-entropy will be. Thus, if our goal is to match one probability distribution as close as possible to another, we need to minimize the cross-entropy between them. How is this used? In machine learning problems a common pattern occurs where we have a true distribution, P, and a model M that outputs another distribution Q. The goal is to find the right parameters for M such that the cross-entropy between P and Q is as small as possible. One popular way to do this is to maximize the log-likelihood, which turns out to be exactly the same thing as minimizing the cross-entropy. You can see the proof here.


In this article we’ve explained what cross-entropy is, explored some examples of cross-entropy to build intuition, and discussed its usage in machine learning. I hope everything was clear, and please leave any questions/comments you might have. If you are interested in topics like this, I plan to write some more posts about important theoretical machine learning concepts, so stay tuned!


Related Articles