An Intuitive Guide To Entropy

Understanding why entropy is a measure of chaos

Aayush Agarwal
Towards Data Science

--

Prerequisite: An Understanding of Expected Value of Discrete Random Variables.

Photo by Siora Photography on Unsplash

Imagine a coin-tossing scenario with the following outcomes and corresponding probabilities.

|  Outcome  | Probability|
|-----------|------------|
| Heads (H) | 1 |
| Tails (T) | 0 |

These values indicate that the coin always shows up heads (H), and if we know that the outcome will always be H, we experience zero “surprise” when we see the actual outcome. It is always H.

More generally, say p is the probability of outcome H. If we use X to denote a random variable which records the outcome of a coin toss, then X takes values in {H, T}. Then Pr(X=H) = p and Pr(X=T)=1-p.

|     X     |    Pr(X)   |
|-----------|------------|
| H | p |
| T | 1 - p |

How do we now generalize the “surprise”?

First thing to note, the surprise is now potentially non-zero as the outcome is not pre-determined. There could be any number of ways to quantify surprise, but we intuit some properties it must exhibit. For instance, when an outcome is unlikely, the surprise upon its occurring should be high, and when the outcome is quite likely, the surprise must be low. In the extreme case where p=1.0 and the outcome H is certain, the surprise associated with it must be zero .

For reasons that are outside the scope of this article, we will use log(1/p) to quantify the surprise associated with an outcome of probability p. This results in a zero surprise for guaranteed outcomes with p=1.0 and outcomes with small values of p will result in a large surprise, just like we want.

Surprise vs. probability | Image by author

Given this formulation, over the course of many coin tosses, we experience a surprise S(H) = log(1/p) whenever the coin shows up heads, and a surprise S(T) = log(1/(1-p)) whenever it shows up tails.

|  X   |  Pr(X) |     S(X)      |
|------|--------|---------------|
| H | p | log(1/p) |
| T | 1 - p | log(1/(1-p)) |

What then is the expected or average surprise?

Average Surprise =     Pr(H) * S(H) + Pr(T) * S(T)
= p * log(1/p) + (1-p) * log(1/(1-p))
= - { p * log(p) + (1-p) * log(1-p)}

Using an alternative notation where pₕ represents probability of heads and pₜ represents probability of tails, we can re-write it as:

Average Surprise = - Σ pᵢ.log(pᵢ)          i ∈ {h, t}

This average or expected surprise, inherent to the random variable’s possible outcomes, is its entropy.

In our coin-tossing example, what values does entropy take for different p?

  1. When p = 1, the outcome is always H. We know exactly what will happen and experience no surprise. The same holds for p=0 and a guaranteed T outcome. The average surprise a.k.a entropy is therefore 0.
  2. When p = 0.9, the coin shows H most of the time. We experience only little surprise on seeing H because we expected it. When the outcome is T, we experience a bigger surprise, but it happens infrequently (only 10% of the time). The average surprise or entropy is -( 0.9 * log(0.9) + 0.1 * log(0.1)) = 0.325.
  3. When p = 0.1, the coin shows T most of the time. The roles of H and T are reversed but the average surprise remains the same as #2, equal to 0.325.
  4. When p = 0.5, it’s difficult to predict the outcome. Neither H nor T is expected so there is some surprise associated with either outcome, and both happen quite often. The entropy is -(0.5 * log(0.5) + 0.5*log(0.5)) = 0.693.

To complete the picture, here is a plot of entropy against p.

Entropy vs. p | Image by author.

Now we have a definition for entropy and (hopefully) an intuition for how it varies with p in the coin-toss example.

You might, however, often hear statements such as “entropy is a measure of chaos in a system”. Such statements are sometimes hard to interpret, but we can develop a little intuition for them by generalizing from the coin-toss example.

Imagine that you have a book at home which could be on the bookshelf, the kitchen, the bathroom or under the sofa with probabilities 0.25 each. Or an omelette pan which could likewise be found anywhere in the house. We intuitively understand such an environment to be a fairly ‘chaotic’ one. Now imagine another scenario where the book is nearly always on the bookshelf with probability 0.95, and only on occasion to be found in the other three places. Such an environment seems ‘orderly’ and less ‘chaotic’. For a few such scenarios, let us look at the entropy of a random variable which takes values equal to the location of the book.

| Book location | Scenario 1 | Scenario 2 | Scenario 3 | Scenario 4|
|---------------|------------|------------|------------|-----------|
| Bookshelf | 0.25 | 0.95 | 1 | 0 |
| Kitchen | 0.25 | 0.02 | 0 | 0 |
| Bathroom | 0.25 | 0.02 | 0 | 0 |
| Under sofa | 0.25 | 0.01 | 0 | 1 |
|---------------|------------|------------|------------|-----------|
| Entropy | 1.39 | 0.25 | 0 | 0 |

Just like in the coin-toss example, the entropy is highest when the probabilities are distributed uniformly and there is highest uncertainty about the location of the book. This is the most ‘chaotic’ scenario we considered.

If the book is always on the bookshelf, entropy is 0. Similarly, if the book is always under the sofa, entropy is 0 because we know thats where the book is going to be and hence experience no surprise when we actually find it there. We might have experienced surprise if the book were to be found somewhere else but that never happens. Average surprise, therefore, is 0. This is the most ‘ordered’ scenario.

We can see that the entropy values computed for various scenarios do reflect the degree of chaos or order pertaining to the location of the book in each scenario. Hence, entropy can be considered a measure of chaos.

That is all for An Intuitive Guide To Entropy. I hope you found it useful. In my next article I talk about cross-entropy and put it in context for Machine Learning applications. Continue reading with An Intuitive Guide To Cross Entropy.

--

--