
It is a common practice to use cross-entropy in the loss function while constructing a Generative Adversarial Network [1] even though original concept suggests the use of KL-divergence. This creates confusion often for the person new to the field. In this article we go through the concepts of entropy, cross-entropy and Kullback-Leibler Divergence [2] and see how they can be approximated to be equal.
The concept of entropy and KL-divergence comes into play when we have more than one probability distributions and we would like to compare how they fair with each other. we would like to have some basis for deciding why minimizing cross-entropy instead of KL-divergence results in the same output. Let’s have two probability distributions p and q sampled from a normal distribution. As shown in the figure 1, both the distributions are different however, they share the fact that both are sampled from a normal distribution.
1. Entropy
Entropy is a measurement of uncertainty of a system. Intuitively, it is the amount of information needed to remove uncertainty from the system. The entropy of a probability distribution p for various states of a system can be computed as follows:

2. Cross-Entropy
The term cross-entropy refers to the amount of information that exists between two probability distributions. In this case, the Cross Entropy of distribution p and q can be formulated as follows:

3. KL-Divergence
The divergence between two probability distributions is the measurement of distance that exists between them. The KL-Divergence for probability distribution p and q can be measured by the following equation:

Where the first term on the right side of the equation is the entropy of the distribution p and the second term is the expectation of distribution q in terms of p. In most of the real world applications, p is the actual data/measurement while q is the hypothetical distribution. In case of GANs, p is the probability distribution of real images while q is the probability distribution of fake images.
4. Verification
Now lets verify that the KL-divergence is indeed same as using cross-entropy for distribution p and q. We compute the entropy, cross-entropy and KL-divergence respectively in python.


And then we compare the two quantities as follows:

The second term on the right hand side, which is the entropy of the distribution p can be considered as a constant and therefore, we can conclude that minimizing cross-entropy in place of KL-divergence results in the same output and hence can be approximated to be equal to it.
In other words, we aim to reach the uncertainty level of the distribution p upon termination of optimization, and since all machine learning optimizations are performed on a controlled dataset which is not expected to change during the experiment, so, we expect the entropy of p to remain constant.
5. Conclusions
In this article, we learned about the concepts of entropy, cross-entropy and kl-divergence. Then we answered why these two terms are often used interchangeably in Deep Learning applications. We also implemented and verified the concepts in python. For full code please refer to the github repository. https://github.com/azad-academy/kl_cross_entropy.git
References:
[1] Goodfellow, I. et al., Generative adversarial nets. In Advances in neural information processing systems. pp. 2672–2680, 2014
[2]https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence