Thoughts and Theory

Implications of Information Theory in Machine Learning

A glance at what Information Theory is and how Machine Learning relies on it.

Preeyonuj Boruah

Published in

Towards Data Science

10 min readApr 12, 2021

Shannon Entropy : Explained — Photo by Janko Ferlic from Pexels

Information. This term has been thrown around in every possible scenario. But rightfully so, the world runs on Information. So, what is it? The simplest way to explain it would be through an example. Imagine you are grocery shopping and you have picked up multiple items. You know the prices of those items; hence that is raw data for you. Later, when you check out at a counter, the cashier will scan those items and give you the summed cost of the items. To elaborate, the cashier will process the number of items with the cost of each item and give you a fixed number, which you can pay. In a way, the cashier processed the raw data (individual item prices) and gave you information (final bill amount). So, I can describe information as processed data that is contextually meaningful.

To take it further, here are two messages:
a) I didn’t go to work.
b) I didn’t go to work because I had a doctor’s appointment.
It is evident that the second message contains more information than the first. But how will I define what “more” is? How do I quantify it? Here’s where Information Theory comes in! The next section will explore the field of Information Theory and the important concepts in it.

Information Theory

Researchers have pondered upon quantifying information since the early 1900s, and in 1948, Claude Shannon published a phenomenal article called “A Mathematical Theory of Communication.” This paper birthed the field of Information Theory. Information Theory, in definition, is the study of the quantification, storage, and communication of information. But it is so much more than that. It made some significant contributions to the field of Statistical Physics, Computer Science, Economics etc.

The primary focus of Shannon’s paper was on the general communication system as he was working in Bell Labs when he published the article. It established quite a few important concepts, such as information entropy and redundancy. In the present day, its core fundamentals are applied in the fields of lossless data compression, lossy data compression and channel coding.

The techniques used in Information Theory are probabilistic in nature and usually deal with 2 specific quantities, viz. Entropy and Mutual Information. Let’s take a deeper dive into these two terms.

Shannon Entropy (or just Entropy)

Entropy is the measure of uncertainty of a random variable or the amount of information required to describe a variable. Suppose x is a discrete random variable, and it can take any value defined in the set, χ. Let’s assume the set is finite in this scenario. The probability distribution for x will be p(x) = Pr{χ = x}, x ∈ χ. With this in mind, entropy can be defined as

The unit of entropy is bit. If you observe the formula, entropy is entirely dependent on the probability of the random variable and not on the value of x itself. There is a negative sign in front of the formula, making it perpetually positive or 0. If entropy is 0, there is no new information to be gained. I will demonstrate the implementation of this formula through an example.

Consider the scenario of a coin toss. There are two probable outcomes, heads or tails, with equal probabilities. If we quantify that, x ∈ {heads,tails}, and p(heads) = 0.5 and p(tails) = 0.5. If we plug in these values in the formula:

Calculating Entropy in a coin toss event

Therefore, entropy is 1 bit, i.e., the coin toss's outcome can be expressed completely in 1 bit. So, to intuitively express Shannon entropy's concept, it is understood as “how long does a message need to be to convey its value completely”. I want to dive a little deeper and discuss the concepts of Joint Entropy, Conditional Entropy and Relative Entropy.

Joint and Conditional Entropy

Previously, I defined Entropy for a single random variable, but now I will extend it to a pair of random variables. It is a simple aggregation as we can define the pair of variables (X, Y) as a single vector-valued random variable.

The joint entropy H(X, Y) of a pair of discrete random variables (X, Y) with a joint distribution p(x, y) is defined as

This can also be represented in terms of expected value.

Joint Entropy (Expected Value form)

Similarly, for conditional entropy, H(Y|X) is defined as:

Intuitively, this is the average of the entropy of Y given X over all possible values of X. Considering the fact that (X, Y) ~ p(x, y), the conditional entropy can also be expressed in terms of expected value.

Conditional Entropy (Expected Value form)

Let’s try an example to understand Conditional Entropy better. Consider a study where subjects were asked:
I) if they smoked, drank or didn’t do either.
II) if they had any form of cancer
Now, I will represent these questions' response as two different discrete variables belonging to a joint distribution.

On the left, you can see the data table where 10 subjects answered the questions. We have three different possibilities for the variable Activity (I will call it X). The second column represents whether the subject has/had cancer (variable Y). There are two possibilities here, i.e. Yes or No. Since we are not dealing with continuous variables yet, I have kept these variables discrete. Let’s create a probability table that will make the scenario clearer.

Next, I will calculate the value of the marginal probability p(x) for all the possible value of X.

Based on the probability table, we can plug in the value in the conditional entropy formula.

Conditional Probability in the above example

Relative Entropy

Relative Entropy is somewhat different as it moves on from random variables to distributions. It is a measure of the distance between two distributions. A more instinctive way to put it would be: Relative entropy or KL-Divergence, denoted by D(p||q), is a measure of the inefficiency of assuming that the distribution is q when the true distribution is p. It can be defined as:

Relative entropy is always non-negative and can be 0 only if p = q. Although a point to note here is that it is not a true distance since it is not symmetric in nature. But it is often considered as a “distance” between distributions.

Lets take an example to solidify this concept! Let X = {0,1} and consider two distributions p and q on X. Let p(0) = 1 — r, p(1) = r and let q(0) = 1 — s, q(1) = s. Then,

I would also like to demonstrate the non-symmetric property, so I will also calculate D(q||p).

If r = s, D(p||q) = D(q||p) = 0. But I will take some different values, for instance, r = 1/2 and s = 1/4.

As you can see, D(p||q) ≠ D(q||p).

Now, since we have discussed the different types of entropies, we can move onto Mutual Information.

Mutual Information

Mutual Information is a measure of the amount of information that one random variable contains about another random variable. Alternatively, it can be defined as the reduction in uncertainty of one variable due to the knowledge of the other. The technical definition for it would be as follows:
Consider two random variables X and Y with a joint probability mass function p(x, y) and marginal probability mass functions p(x) and p(y). The mutual information I (X; Y) is the relative entropy between the joint distribution and the product distribution p(x)p(y).

Mutual Information can also be expressed in terms of Entropy. The derivation is quite fun, but I will refrain myself as it might clutter the article.

Mutual Information w.r.t. Entropy

From the above equation, we see that mutual information is the reduction in the uncertainty of X due to the knowledge of Y. There is a Venn diagram perfectly describing the relationship.

Relationship between Mutual Information and Entropy

Let’s take an example to understand it better. I can use the example of relating Smoking, Drinking, Neither with having cancer that I used while explaining entropy. We had seen that the H(Y|X) = 0.8184 bits. To calculate Mutual Information, I require one more term H(Y). H(Y), in this case, will be:

Therefore, Mutual Information is defined by:

Alternatively, I can also use H(X) and H(X|Y) to calculate Mutual Information, and it will yield the same result. We can see how knowing X means so little for the uncertainty of variable Y. Let me change the passage of this example and lead you to how it all makes sense in Machine Learning. Suppose X is a predictor variable and Y is the predicted variable. Mutual Information between them can be a great precursor to check how useful the feature will be for predictions. Let us discuss the implications of Information Theory in Machine Learning.

Applications of Information Theory in Machine Learning

There are quite a few applications around, but I will stick with a few popular ones.

Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. The core algorithm used here is called ID3, which was developed by Ross Quinlan. It employs a top-down greedy search approach and involves partitioning the data into subsets with homogeneous data. The ID3 algorithms decide the partition by calculating the homogeneity of the sample using entropy. If the sample is homogenous, entropy is 0, and if the sample is uniformly divided, it has maximum entropy. But entropy does not have direct implication on the construction of the trees. The algorithm relies on Information Gain, which is based on the decrease in entropy after a dataset is split on an attribute. If you think intuitively, you will see that this is actually Mutual Information that I had mentioned above. Mutual Information decreases the uncertainty of one variable given the value of the other. In DT, we calculate the entropy of the predicted variable. Then, the dataset is split based on entropy, and the entropy of the resultant variable is subtracted from the previous entropy value. This is Information Gain and, obviously, Mutual Information in play.

Cross-Entropy

Cross entropy is a concept very similar to Relative Entropy. Relative entropy is when a random variable compares true distribution p with how the approximated distribution q differs from p at each sample point (divergence or difference). Whereas cross-entropy directly compares true distribution p with approximated distribution q. Now, cross-entropy is a term heavily used in the field of deep learning. It is used as a loss function that measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverge from the actual label.

KL-Divergence

K-L Divergence or Relative Entropy is also a topic embedded in the deep learning literature, specifically in VAE. Variational Autoencoders take in input in the form of Gaussian Distributions rather than discrete data points. It is optimal for the distributions of the VAE to be regularized to increase the amount of overlap within the latent space. K-L divergence measures this and is added to the loss function.

K-L Divergence is also used in t-SNE. tSNE is a dimensionality reduction technique that is mainly used to visualize data in high dimensions. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

Calculating imbalance in target class distribution

Entropy can be used to calculate target class imbalances. If we consider the predicted feature as a random variable with two classes, a balanced set (50/50 split) should have the maximum entropy as we saw in the case of the coin toss. But if the split is skewed and one class has a 90% prevalence, then there’s lesser knowledge to be gained, hence a lower entropy. Implementing the chain rule for calculating entropy, we can check whether a multiclass target variable is balanced in a single quantified value, albeit an average that masks the individual probabilities.

Summary

Information Theory is an exciting field and has major contributions to multiple fields. Machine Learning, being one of them, has not fully exploited everything Information Theory has to offer. I feel there are numerous Information Theory concepts to be discovered within the context of Machine Learning, and as a Data Scientist, it makes me quite enthusiastic. On that note, I would like to conclude my article.

I hope you liked my article and in case you really, really liked it, here is my:

My website: https://preeyonujboruah.tech/
Github Profile: https://github.com/preeyonuj
Previous Medium Article: https://medium.com/analytics-vidhya/aptos-blindness-challenge-part-1-baseline-efficientnet-c7a256daa6e5?sk=d0e445f99daa71d79f0452665f1a59db
LinkedIn Profile: www.linkedin.com/in/pb1807
Twitter Profile: https://twitter.com/preeyonuj