The world’s leading publication for data science, AI, and ML professionals.

Information Theory: A Gentle Introduction

This is the first in a series of articles about Information Theory and its relationship to data driven enterprises and strategy. While…

This is the first in a series of articles about Information Theory and its relationship to data driven enterprises and strategy. While there will be some equations in each section, they can largely be ignored for those less interested in the details and more in the implications.

The early and middle parts of the 20th century saw an explosion in telecommunication technologies and capabilities. With this, engineers and researchers were faced with numerous questions that had never really been rigorously tackled before such as:

· How do I convert words and sounds into dots and dashes?

· How do I reconstruct electronic signals into words and sounds?

· What about when there is noise added to the electronic signals?

· How little information can I transmit and still reconstruct a message?

· What is information anyway?

In doing so these researchers accidentally built one of the most powerful set of tools applied math has given us for approaching and solving a myriad of otherwise ‘not so mathy’ problems. The questions above and the adjunct questions they spawned were clearly inspired by telecommunication; however those very same challenges have analogs to modern challenges both data driven enterprises and the modern analytics leader face:

· What information do I absolutely always need?

· What information can I ignore and when can I ignore it?

· How complex is my portfolio or strategy and does the data I have justify it?

· Should I trust the data or my experience more? (perhaps surprisingly we will see the answer is frequently ‘don’t trust the data’)

· Where is my project likely to fail?

· What systematic risk do I face in my financial exposure?

· Now that I have the data, what should I do?

Over the course of the first few article we’ll introduce 5 tools – Entropy, Mutual Information, Huffman Codes, Kolmogorov Complexity and Fisher Information – for assessing these and other challenges in both personal and professional life. And while future posts will focus more on the principals and implication, first we have to introduce the building block of Information Theory: Entropy

Entropy: a Measure of Randomness

If you talk to a quantitative professional and ask them how long they’ve been at it the usual, diplomatic answer involves their number of years since finishing undergrad. The reality, though, for most people I have known in Data Science and its adjacent fields is that they have been steeped in the subject matter since at least their mid-teens. With each year the same material, more or less, is presented with an additional layer of precision, care and/or complexity. Usually definitions are additive. Gravity, as an example, is initially the thing that causes apples to fall, before being about the way massive bodies interact with each other, then having something to do with acceleration and distortions on manifolds. Each more complex, each more complete.

Not Entropy though. In my 20 years studying and practicing I have been given a half dozen definitions of Entropy. While the definitions have had some time and manner utility, they were not additive and not easy. Here are a few of the worse ones:

Entropy is…

· The disorder or chaos in a system

· The number of degrees of freedom in a system

· A ‘force’ that obscures the initial conditions of a system

The first definition, probably the most common, is flat out wrong. The second two, while reasonably correct, leave you with no intuition behind what it means.

Here is Information Theory‘s first great triumph: a sensible and intuitive definition for the most misunderstood concept in science. Entropy is the degree of uncertainty –or randomness – in a system.[1] __ That’s a little better than just calling it disorder or chaos, but it implies something profound. Namely different systems and different random events have different levels of randomness. Let’s check our intuition with statisticians’ two favorite toys, a coin and a dice. Is a coin toss or a dice roll more random? Let’s check in 2 ways. First we ask the question, if we guessed randomly (or later optimally) how often would be we right? For a coin toss its 50% of the time and for a dice roll it’s ~16.7% of the time. Second let’s ask how many possible states (read: outcomes) could occur. For a coin its 2 (heads or tails) and a dice its 6 (1,2,3,4,5,6). In both cases it looks like the dice is more random than a coin. The notion that different events have different levels or randomness pass the sniff test.

For fair coins and fair dice it’s easy[2] to see which is more random, and our intuition quickly agrees. Now let’s change the comparison. Suppose we still have our fair coin, but now we want to compare it to a weighted dice. This weighted dice comes up ‘1’ 90% of the time, and 2% of the time for each remaining face (2% for 2, 2% for 3, etc). Now which is more random. Random guessing is still 50% effective for the coin, and 16.7% for the dice[3]. Again, there are two states for the coin and six for the dice. But the loaded dice feels less random than the fair one; after all it almost always comes up ‘1’. Now we need a more rigorous definition of entropy.

Entropy (H) is dependent on the number of possiblestates and their probabilities
Entropy (H) is dependent on the number of possiblestates and their probabilities

This is usually the part where people tune out, big sigmas are intimidating. However all equation 1 says is something we already feel to be true: events with more outcomes tend to be more random than those with fewer and events with more evenhanded outcomes tend to be more random than events heavily skewed to just a few outcomes. Equation 1 gives us a formal way to describe this by stating that for an event X with outcomes i, the entropy H is equal to the sum of the probability of each outcome and the logarithm base 2 of that probability. Why base 2? Because most complex scenario can be decomposed into many 2 state events described by: yes or no, is or is not, true or false, equals or does not equal.

Now back to our scenario, the loaded dice vs the coin. Let’s start by stratifying our intuition by comparing the fair dice to the fair coin. The fair dice has an entropy of 2.58, while our fair coin has an entropy of exactly 1. This certainly agrees with our intuition noted above. And the loaded dice? The loaded dice has an entropy of ~0.7, lower than the fair dice (agrees with our gut) and lower than the coin. Now that we have a better handle on randomness, entropy and how to us it to measure systems, events and processes we can move on to how this relates to Information.

Entropy and Information

As a reminder a coin flip had an entropy of exactly 1. We’ve established that that means it’s less random than a dice roll, but it also means it contains less information. More precisely what it means is that a coin flip can be represented in 1 bit of information.

A bit is a unit of information that can exist in one of two states, a 1 or 0. The reason a coin flip can be represented as one bit is that a coin has two sides that come up equally likely. We map[4] the 0 to tails and 1 to heads and see that our coin flip can be represented in 1 bit.

Mapping a Bit to a Coin Toss
Mapping a Bit to a Coin Toss

The case of the fair dice is a bit more complicated. It has an entropy of 2.58 and can therefore be written in 2.58 bits. But bits are discrete and there is no such thing as .58 bits. We can try to represent it in 2 or 3 bits though. First with 3:

A 3 bit dice roll has additional capacity; we null out 2 possible states to account for this
A 3 bit dice roll has additional capacity; we null out 2 possible states to account for this

With 3 bits we have excess capacity, or put another way with 3 bits we have a considerably more complex model of the scenario than we need. If we try to write the dice roll on to 2 bits though we get:

A 2 bit dice roll destroys information
A 2 bit dice roll destroys information

We are not able to fully write the fair dice onto 2 bits. This is a lossy representation. We lose information – in this case outcomes of ‘5’ and ‘6’ – when we try to represent the event on too little information.

All three of these – the 1 bit coin flip, 2 bit dice roll and 3 bit dice roll – are models of events. As the model’s bits go up so does its complexity. We can create a rule of thumb here: if we try to represent a system with too complex a model we are inefficient at best and at worst overly complex and in danger of being very wrong. If we represent a system with too few, we are losing important information.

Finally let’s look at our unfair dice. It had an entropy of .7. Two things should strike you here. First, we would lose about the same amount of information with a model on our unfair dice that states ‘this dice only comes up as 1’ as we would with a 2-bit model on our fair dice. This leads us to another rule of thumb: even though the events are rare, removing outliers can remove huge amounts of information. Second, it has much lower entropy and can fit in a much simpler model than the fair dice. Let’s next examine how the increased amount of information we had about the unfair dice led to this result.

Conditional Entropy and the Power of Information

Going back to the definition, entropy is a measure of uncertainty. If you want to understand – or better yet exploit – a system having less of it is generally good. We’ve already examined one bad way to reduce the entropy of an event: make a lossy model that does not accurately depict the scenario. Here we will examine the best way to reduce uncertainty by finding useful information.

First an aside though, a quirk of probability is that if you know nothing about a scenario aside from the possible states the best guess you can make is that all states are equally likely. This is true even if you are told a scenario is unfair in some non-specific way. For example, if you are told a coin is unfair but not by how much or in which direction than the average of all the ways it could be unfair is for it to be fair.

In the case of our weighted dice we are given a crucial piece of information. Notably that the dice is heavily weighted to ‘1’ such that the probability of a ‘1’ appearing is 90%. With that we also infer that the probability of a 2,3,4,or 5 appearing is 2% each. The result is conditional entropy. More formally:

Conditional Entropy (H(X|Y) is entropy with more information; all else being equal it is always less than uninformed entropy
Conditional Entropy (H(X|Y) is entropy with more information; all else being equal it is always less than uninformed entropy

Just like with equation 1 these look more intimidating than they are. All equation 2 says is that conditional entropy replaces the probability of an event occurring with the probability given some new information. In the weighted dice case the new information (Y) is that the probability of ‘1’ appearing is 90%. Equation 3 follows from equation 2 and notes that the conditional entropy always reduces the uncertainty of a system. A little less formally _relevant information always reduces the uncertainty of a system.[5]_

Let’s look at another scenario. Suppose you live in a LA. It rains about 33 days per year there, so on any given day the chance of it raining is ~10%. Without any additional information LA’s precipitation schedule can be effectively modeled in ~170 bits. LA does have a rainy season though, ~90% of rainy days (30 of them) occur between Nov 1 and April 30. By knowing there is a rainy season and its impact on the likelihood of rain the entropy of the precipitation schedule reduces to ~ 145 bits, a 15% reduction in uncertainty.

Portfolio Analysis: DJI is a Weird Index

As mentioned earlier, this article will be unique in the amount of time spent on toy problems and fundamentals. Getting a good technical and intuitive understanding of entropy will be critical as we encounter more complex topics. Now that we have a good understanding though we can leave behind toy problems and get to more compelling examples. For many of these examples the Dow Jones Industrial Average will serve as a whipping boy.

If you are looking for a system chalked full of uncertainty that people would like to exploit, equity markets are probably a good place to start. Suppose you wanted to build a portfolio to capture returns in the exchange space. The exchange space is dominated by 3 firms: The Intercontinental Exchange (ICE, $68B Market Cap, trading @ ~120), The London Exchange Group (LDNXF, $53 Market Cap, trading @ ~105) and Nasdaq (NDAQ, $26B Market Cap, @ ~160). Future returns are uncertain, but we can leverage the principals covered thus far to construct a few sensible strategies for trying.

To begin, let’s note that a portfolio is little more than a model of where returns may be. If we have no other information we should pursue a strategy that maximizes portfolio entropy. This yields a portfolio made up of 1/3 ICE, 1/3 LDNXF and 1/3 NDAQ.

There are a few other sensible ways to construct a portfolio. If we want to replicate the current state of the industry we may pursue a market cap weighting strategy – 44% ICE, 39% LDNXF, 17% NDAQ – and we will cover why in the next article. Similarly if we have special knowledge of future returns we again may pursue a strategy based on our expectation of how returns are distributed across the three. That brings us to the Dow.

The Dow Jones Industrial Average is a price weighted portfolio. This means the elements are weighted by the price they are trading at. In our case of exchanges that would produce a portfolio of 31% ICE, 27% LDNXF, and 42% NDAQ. There is no information intuition for this strategy. It’s just weird.

Trading and Benchmarking

One last example before we close out. Instead of constructing a portfolio suppose instead we wanted to build a strategy to trade particular stock or index. For the sake of simplicity our strategy can do one of 2 things on a given day: buy or hold, sell or do nothing and hold cash. We need a way to determine if any given strategy we devise is effective. Note that a strategy, just like a portfolio, is nothing more than a model of returns.

Since strategies involve making numerous sequential decisions, each decision should add value above a considerable more simple model. Recall that a simple model is one that can be represented is very few bits. There are 2 strategies here that that can be written in 0 bits. The first is pretty trivial: just hold cash. I think anyone would agree if your strategy can’t beat holding cash it’s a bad strategy; intuition check passed. The second 0-bit strategy is buy and hold, a strategy where you buy the stock on the first trading day and then hold it through the duration. There are a myriad of real world reason this is a good benchmark including transaction costs and tax optimization. Our information intuition, though, is that a more complex model ought to provide improved returns. A 1 bit model should improve on the 0 bit strategy and a 2 bit model should improve even further.

Information Risk

A natural corollary from our assessment of benchmarking is that increased complexity should be meted out with increased returns. This is our first encounter with a concept I am calling Information Risk. Information Risk is a kind of fragility born from moving to far from a robust, low information strategy without sufficient relevant data to justify it. It’s related to concepts like overfitting, regularization, and cross-validation but has a slightly different flavor. How to use information in robust and sensible way that reduces Information Risk will be the driving factor behind this series.


Next Article: Mutual Information as Prediction

[1]Elements of Information Theory, Cover & Thomas

[2] As an aside, from here on out things that are easy will be referred to as easy, things that are hard as trivial, and thing that are impossible as ‘exercises left to the reader’

[3] We will cover optimal guessing in the next article

[4] Mathematician speak for ‘substitute one thing for another’

[5] Again we will give this a more formal treatment when we introduce mutual information

Disclaimer: The views expressed herein are those of the author and not necessarily those of any employer, organization or affiliate thereof.

© 2021 Douglas Hamilton


Related Articles