But what is Entropy?

Published in

Towards Data Science

12 min readJul 13, 2020

Source: https://pixabay.com/users/tumisu-148124/

This write-up re-introduces the concept of entropy from different perspectives with a focus on its importance in machine learning, probabilistic programming, and information theory.

Here is how it is defined by the dictionaries as per a quick google search -

Based on this result, you can notice that there are two core ideas here and at first, the correlation between them does not seem to be quite obvious -

Entropy is the missing (or required) energy to do work as per thermodynamics
Entropy is a measure of disorder or randomness (uncertainty)

So what is it — missing energy, or a measure, or both? Let me provide some perspectives that hopefully would help you come to peace with these definitions.

Shit Happens!

Rephrasing this obnoxious title into something a bit more acceptable

Anything that can go wrong, will go wrong — Murphy’s Law

We have all accepted this law because we observe and experience this all the time and the culprit behind this is none other than the topic of this writeup — yup, you got it, it’s Entropy!

So now I have confused you more — entropy is not only the missing energy and the measure of disorder but it is also responsible for the disorder. Great!

We can not make up our minds here as far as the definition is concerned. However, the truth is all of the above-mentioned 3 perspectives are correct given the appropriate context. To understand these contexts let’s first check out disorder and its relation with entropy.

Disorder is the dominating force

I explain this with the help of examples from an article by James Clear (Author of Atomic Habits).

Source: Left Image (https://pixabay.com/illustrations/puzzle-puzzle-piece-puzzles-3303412/) Right Image (Photo by James Lee on Unsplash) + annotated by Author

Theoretically, both of these are possible but the odds of them happening are astronomically small. Ok, fine, call it impossible 🤐 !. The main message here is the following:

There are always far more disorderly variations than orderly ones!

and borrowing the wisdom of great Steven Pinker -:

**Steven Pinker —** Johnstone Family Professor in the Department of Psychology at Harvard University (Image created by Author)

How do we fight back the tide of entropy?

By putting necessary effort. Effort implies that we ought to spend the energy to fight the disorder created by Entropy.

In thermodynamics, this perspective of entropy is dominant i.e. you inject the necessary (missing) energy into your system to bring it to equilibrium.

Since this is a machine learning journal I would not dwell more on this aspect (i.e. statistical thermodynamics) rather I am providing a link to a fantastic article that explains in a way that I never could.

Satanic science

As I turn on the heating on this miserable February day in North London I am very grateful for a basic fact of nature…

plus.maths.org

What you want to pay attention to is what the formula of Entropy as a measure of disorder looks like as it is shown on Boltzmann’s tombstone.

Source: https://www.atlasobscura.com/places/boltzmanns-grave + annotated by Author

Next, let’s explore entropy from the perspective of information which is what machine learning intends to extract from the data.

Entropy is a measure of information

If you are thinking — earlier he said entropy is a measure of disorder or randomness (uncertainty) and now it has been morphed into a measure of information — then this means you are paying attention. Good Job! :)

Information and uncertainty are indeed related to each other. Be patient, you will see it soon.

The word “information” means knowledge that one does not have. It ain’t information if you already know it!

Now information can also be seen as a “surprise” albeit the amount by which you get surprised will vary. E.g., if I tell you that SUN will rise tomorrow you would say — meh! no surprise i.e. there is no information in this statement but if I tell you that the world will end tomorrow you would be very surprised (… at least very sad, I hope ! 😩)

The benefit of thinking in terms of surprise is that most of the time with “information” we think very binarily — either I have the information or don’t, whereas “surprise” helps bring a notion of degree of variability. Your surprise is inversely proportional to the probability (chance) of an event happening. Rarer the event, the more surprised you are! And what is probability? …. it’s a measure of uncertainty!

By the grace of Aristotle’s logic, you can appreciate that we have established that information (surprise) and uncertainty (probability) have a relationship going on. Next, let’s try to formulate it mathematically.

There is something bizarre about the above mathematical formulation. I do not know about you but I sure am not happy!

When I am given information (to surprise me .. duh !) about two (😲 &🤔) independent events, the total surprise is going to be additive (😲 + 🤔)in nature and not multiplicative (😲 x🤔).

Let’s fix this.

So, the surprise is inversely proportional to the logarithm of the probability of an event (Random Variable). The logarithm function helped in bringing the additive aspect of surprise home for us. In other words, we can improve our intuition on the calculation of the surprise function — which our common sense would say must be an addition and not a product (multiplication) of event probabilities by realizing that multiplication can be viewed as addition by working on the logarithm of the probabilities involved.

The next quest for us is to formulate “How surprised I am going to be in the long run ?”

Why we are interested in the above quest (keyword here — “long-run”) is because we have brought Random Variables (and hence uncertainty) in the mix here and therefore we need to measure the central tendency of our surprise.

Some time back I wrote an article called “But what is a Random Variable?”. It may be a good idea to check it out if you are not comfortable with Random Variables & Distributions.

But what is a Random Variable?

Clear and simple explanations for what Random Variables are and their connections to probability theory.

towardsdatascience.com

Expected value (or central tendency) of a Random Variable

When we deal with regular numbers we often use statistics about the numbers using mean, mode, and median. Let’s focus only on mean for a minute here. Mean (or average) is an indicator of central tendency of your numbers and when you compute it you give equal weightage to all of your numbers. Math is simple — Take all the numbers and divide them by the count of numbers.

In the case of Random Variables, we do not have all of our numbers (samples) that belong to this Random Variable available at a given moment, and more importantly, Random Variables are all about fairness; contributions matter to them, therefore unlike traditional averaging of numbers they do not give equal weightage to all possibilities!

We can resolve the above challenge by thinking in terms of “Long term averages” also called the Expected Value of Random Variables. Here is how you compute average (sorry .. expected value) of a Random Variable.

Every possibility’s (x_i) contribution to the central tendency (expected value) is given by weightage equal to the possibility (probability) of its occurrence.

Expected value (or central tendency) of the Surprise

If you understood the expected value of a Random Variable then you would not be surprised here (pun intended 😎 !)

The average or expected surprise should also be the sum of the weighted average of all surprises!

Applying what we have learned earlier -:

As you can see after the simplification of various steps in the above illustration we ended up with Expected Surprise looking like the formula for Entropy (from thermodynamics). See the image of Boltzmann’s tombstone with my annotations. I seriously hope his ghost is not going to haunt me for this!

This formulation of the expected value of information (surprise) was done by Claude Shannon in the year 1948 in his seminal paper — A Mathematic Theory of Communication

Claude Shannon: The Father of Information Theory (Image Source: Wikipedia)

Claude Shannon’s quest was to transmit and receive information using the least number of bits possible. The formula that I derived earlier with the help of the notion of surprise can also be used to tell you the minimum number of bits for a message. To do this you would use log with base 2. Let’s do a few examples as well. These examples are taken from the book Elements of Information Theory and solved using TensorFlow probability.

Why it is called Entropy in the context of Information Theory is because the measure (formula) looked like the formula from thermodynamics. It is called Shannon’s Entropy or Information Entropy and ….really just Entropy!

Shannon’s work also laid the foundation for the compression of files where the core idea is to use fewer bits to represent information without distorting the original message.

Using Entropy to choose Probability Distributions

First, customary mentioning of Bayes’ Theorem.

Source: Image From Wikipedia + Annotations by Author

Have you ever wondered how one should choose a distribution for prior and likelihood? A lot of statistics is conventionally done by using Normal (Gaussian) distribution but very often the conventional choices are not the best choices.

Ignorance is preferable to error and he is less remote from truth who believes nothing than he who believes what is wrong — Thomas Jefferson

The wisdom of Thomas Jefferson could be our guiding force to make this choice. Translating it into our problem domain would look something like this-

The distribution that has the largest spread of probabilities is the one with the biggest (maximum) entropy. This also means that it is also the least informative and most conservative choice you can make. This is the core idea behind what is known as THE PRINCIPLE OF MAXIMUM ENTROPY DISTRIBUTION. Following it helps in not introducing any additional bias or assumptions in our calculations.

One rationale often cited for the ubiquitous choice of Normal distribution is that it follows Central Limit Theorem. The principle of maximum entropy can also be used to justify its usage. It can be proved analytically that if variances of distributions are known then Normal distribution is the distribution with the maximal entropy. If you are interested in going through the rigor of the proof, I would recommend going through these notes — https://mtlsites.mit.edu/Courses/6.050/2003/notes/chapter10.pdf

Relative Entropy (KL Divergence)

So far we have used entropy as an indicator of the informativeness of a probability distribution. The next question we should ask ourselves is— Could we use it to compare two distributions?

Comparison between distributions can be seen as the distance (in some space) however for distributions this distance is not symmetric in nature i.e. given p & q distributions, the distance from p to q would not be the same as the distance from q to p unless of course, they are same. In that case, it is zero. This is why we use another term to describe the dissimilarity and call it divergence.

There are few different types of divergences, the most widely used and known is the one created by Solomon Kullback & Richard Leibler to measure the relative entropy between the distributions

Solomon Kullback & Richard Leibler inventors of KL Divergence (Images from Wikipedia)

Clearly, entropy has a role to play in calculating this divergence but you have to wait a bit to see it.

Instead of throwing the formula of KL divergence at you, we would want to develop an intuition about its origin and for that, we take help from the concept of likelihood.

Pretend that by performing an experiment (e.g throwing a dice thousands of times) you have observed the true distribution let’s call it p. Now let’s say that we have another candidate distribution (model) call q that may also be suitable for describing our Random Variable. One way we compare these two distributions (models) is by looking at the ratio of their likelihood. This is a form of hypothesis testing called the likelihood-ratio test (See Wikipedia; it is explained really well). In practice, we take the natural log of the likelihood ratio.

We could use this log-likelihood ratio and weigh it using the occurrence of data points in p and that would give us the expected value of the difference (distance) between the two distributions. Time to see it mathematically:

As you can see that our entropy formula appears again here and this is why KL Divergence is also called Relative Entropy.

We can also understand KL Divergence with the help of surprise. The question you essentially ask is —

For a given Random Variable X if you select distribution q instead of a reference distribution p what is the relative change in surprise?

Approximating a distribution using a reference distribution

One of the main areas in machine learning where you find the application of KL Divergence is its usage as a loss function when doing variational inference.

The variational inference involves the approximation of an intractable posterior distribution (from Bayes’ Theorem) using another (joint) distribution for which we have an analytical form available. As your neural network learns to minimize the KL Divergence loss your reference (p) and target (q) distributions see the reduction in relative entropy.

Cross-Entropy

I want to cover this aspect of entropy as well because it is widely used in machine learning especially for classification tasks. In some ways, we have already seen it hiding inside the KL divergence.

So H(p) is the entropy (self-information) of a distribution p and H(p,q) is the cross-entropy between distributions p & q.

There is another way we can read it and more importantly appreciate the difference between Relative Entropy (KL Divergence) & Cross-Entropy. We would try it to see from Shannon’s usage of entropy as a number of bits required for information.

Let’s say for encoding message, we want to use q as the candidate (target) distribution because of its analytical form rather than the true distribution p (We discussed this earlier section on variational inference). However, the reality is that our approximation is not perfect so KL divergence can be seen as the average number of extra bits needed for encoding by q. Whereas, the cross-entropy is to be seen as the average number of bits needed to represent the information using q.

Below is an example code snippet showing the cross-entropy and KL divergence relationship.

Concluding Remarks

Entropy can intimidate you because of many different formulations and widespread usage in different areas of science but as we saw in this article they are indeed connected to each other.

Key points to remember are -:

Entropy is a measure of information
Information is surprise
Entropy helps you choose appropriate distribution (given constraints) for your domain problem
Approximating one distribution using another relies on the relative entropy between these distributions

If you have questions and require more clarifications please write them in the comments and I will be happy to try to answer them and update the article if needed.