We use the word information a lot in daily life. After reading a good book, we say that it was informative, and after we watch boring commercials we complain that we wasted our time – we got no useful information. Because high information tends to be useful, people have tried to come up with a more rigorous definition of exactly what information means.
The most popular definition of information is that it measures how surprising something is. The more surprising an event is, the more information it contains. For example, consider two possible events: tomorrow there will be a hurricane, and tomorrow there will not be a hurricane. Almost always the second event will be true, as hurricanes are rare. Therefore, if you’re watching the weather report and the weatherman says "clear skies and sunny tomorrow", you probably won’t pay much attention and just go about your day. If he says there’s going to be a hurricane, you’ll immediately start planning, call all your friends and family, etc. That is the result of rarity and high information. Our earlier example of good books vs. TV commercials also works with the surprise definition of information. Good books are rare – therefore when we read one it’s a big deal (high information). TV commercials are common, so we don’t care about them (low information).
The good thing about using surprise as the definition of information is that it has a very natural relationship to probability. Namely, surprising things have low probability, implying that high information corresponds to low probability. Mathematically, this means that I is positively correlated with 1/p, where I is information content and p is the probability of the event. How exactly does the positive correlation look like though? We can answer this question by doing a coin flipping thought experiment.
Let’s start simple: we have a fair coin and flip it once. There are two possible outcomes (heads/tails), so we need at least 1 bit (recall that a computer bit takes the values 0 or 1) to represent all possible outcomes. The probability of getting heads is ½. Now let’s think about what happens if we flip the coin twice. There are four possible outcomes, and we need at least 2 bits (00, 01, 10, 11) to represent them. The probability of both flips being heads is ¼.
Now we ask the key question. How much information does it take to represent the event of getting 1 head in a row (p = ½)? If we let bits be our unit of information, it takes 1 bit of information as we just discussed. What about two heads in a row (p = ¼)? 2 bits. Extending the pattern, three heads in a row (p = 1/8) would take 3 bits, and n heads in a row (p = (1/2)^n) would take n bits. From this pattern we can see a relationship: I (number of bits) = log_2(1/p). This is the most widely used equation for information, and also called Shannon’s Information in honor of Claude Shannon who discovered this equation.
We can use this formula to determine the amount of information/bits we need to use to represent other things as well. Say we have an alphabet with three letters: A, B and C that appear with probability 50%, 25%, and 25% respectively. We want to find the most compact representation of our alphabet using bits. According to our formula, we would need log_2(2) = 1 bit for A and log_2(4) = 2 bits for B and C. One possible representation would be A = 1, B = 10, and C = 11. Of course, these examples were lucky in the sense that the base-2 log of the inverse probability turned out to be a whole number. Most of the time in practice this won’t be the case. For instance, you might get something that according to the formula requires 3.56 bits. Even in these cases, however, the formula gives you a good guideline of about how many bits are optimal. In other words, if you know you need at least 3.56 bits, your representation that uses 20 bits probably could be improved.
So far we’ve come up with a definition for information: how surprising an event is, or more mathematically the number of bits needed to represent the event. We might also be interested in the average information (or average number of bits) over all possible events. This is called the entropy. Going back to our coin example, let’s consider the two-flip case. There are four possibilities, all equally likely. Thus the entropy is just the information of one possible case, which has probability 25%: log_2(4) = 2. In our ABC example, the entropy would be (1/2)(log_2(2)) + (1/4)(log_2(4)) + (1/4)(log_2(4)) = 1.5. Therefore, two flips of a coin on average gives us more information than one letter from ABC.
From entropy, the next step is to develop the concept of cross-entropy, which is basically entropy (average information) when the bit representation for the events is derived from a different probability distribution than the actual, true probability distribution. Cross-entropy is commonly used in Machine Learning algorithms. However, this topic is more complicated and will be explained in another post.
In this article we came up with a mathematical way to represent information and introduced the concept of entropy. Please leave any questions/comments you might have, and if you’re interested keep a lookout for the post on cross-entropy. Thanks for reading!