The world’s leading publication for data science, AI, and ML professionals.

Entropy and Gini Index

Understanding how these measures help us quantify uncertainty in a dataset

Introduction to Entropy and Gini Index

Can you tell which are the purest and impurest carts? (Source: Image by Author)
Can you tell which are the purest and impurest carts? (Source: Image by Author)

Entropy and Gini Index are important machine learning concepts particularly helpful in Decision Tree algorithms to determine the quality of a split. Both of these metrics are calculated differently but ultimately used to quantify the same thing i.e. uncertainty (or impurity) within a dataset.

The higher the Entropy (or Gini Index), the more random (mixed) the data is.

Let’s have an intuitive picture of impurity in a dataset and understand how these metrics can help measure it. (Impurity, uncertainty, randomness, heterogeneity – all can be used interchangeably in our context and the goal is to ultimately reduce them to have better clarity).

What is impurity – explained with an example

Imagine that you go to a supermarket with your friends – Alice and Bob to buy fruit. Each of you grabs a shopping cart because none of you likes sharing your fruits. Let’s find out what you guys got (looks like you love apples!!):

Image by the Author
Image by the Author

These three carts can be seen as three different data distributions. If we assumed that there are two classes (apples and bananas) initially, then the interpretations that follow would be incorrect. Rather, think of each cart as a different distribution – so the first cart is a data distribution where all data points belong to a single class, and the second & third carts are the data distributions with two classes.

Looking at the example above, it is easy to identify the carts with the most pure or impure data distributions (class distributions to be precise). But in order to have a mathematical quantification of purity in a dataset so that it can be used by an algorithm to make decisions, Entropy and Gini Index come to rescue.

Both of these measures look at the probability of occurrence (or presence) of each class in a dataset. In our example, we have a total of 8 data points (fruits) in each case, so we can compute our class probabilities for each of the carts as follows:

Image by the Author
Image by the Author

Now we are equipped with everything we need to dive into formal definitions of Entropy and Gini Index!

As already discussed, both entropy and gini index are a measure of the degree of uncertainty or randomness in data. While they aim to quantify the same fundamental concept, each has its own mathematical formulation and interpretation to achieve that.

Entropy

Given a labeled dataset where each label comes from a set of n classes, we can compute entropy as follows. Here pi is the probability of randomly picking up an element from class i.

To determine the best split in a decision tree, entropy is used to compute information gain, and the feature contributing to the maximum information gain is selected at a node.

Gini Index

Gini Index attempts to quantify randomness in a dataset by finding an answer to this question – What is the probability of incorrectly labeling an element picked randomly from the given data?

Given a labeled dataset where each label comes from a set of n classes, the formula to calculate gini index is given below. Here, pi is the probability of randomly picking up an element from class i.

This formula is often reframed as follows as well:

(Note: The sum of all class probabilities is 1).

Gini index is an alternative to information gain that can be used in decision tree to determine the quality of split. At a given node, it compares the difference between the gini index of the data before split and weighted sum of gini indices of both branches after split and chooses the one with the highest difference (or gini gain). If this is unclear, don’t worry about it for now since it needs more context, and the goal of this article is to just have a basic intuition behind the meaning of these metrics.

Going back to our example

To make things easier to understand, refer to our shopping cart example, we have three datasets – C1, C2, and C3, each of which has 8 records with labels coming from two classes – [Apple, Banana]. Using the probabilities calculated in the table above, let’s unroll both of these computations for Alice’s cart:

Similarly, we can compute these metrics for C1 and C3 as well and will get the following results:

Image by the Author
Image by the Author

From the above table, we can have some interesting takeaways about the range of values both entropy and gini index can have. Let’s call the lowest possible value as the lower bound and the maximum possible value as the upper bound.

Lower Bound

The lower bound of both entropy and gini index is 0 when our data is purely homogeneous. Take a look at cart C1 for reference.

Upper Bound

Entropy and Gini Index are 1 and 0.5 respectively when data has the highest uncertainty (take a look at cart C2 for reference as C2 represents an example having the highest possible randomness).

One thing to note here is that these values for the upper bound will only hold in case of binary classification (because that’s what our two-class apple-banana example represents). In the case of n classes which are equally likely each having probability (1/n), the upper bound will be a function of n, as shown below.

Upper bound for entropy:

  • For binary classification, the upper bound of Entropy is 1.
  • For multi-class classification with each class having same probability, the upper bound of Entropy will be:

Upper bound for Gini Index:

  • For binary classification, the upper bound of Gini Index does not usually exceed 0.5
  • For multi-class classification with each class having same probability, the upper bound of Gini Index will be:

Recap

  • Entropy and gini index are used to quantify randomness in a dataset and are important to determine the quality of split in a decision tree. We can use the terms randomness, uncertainty, impurity, and heterogeneity interchangeably here.
  • High values of entropy and gini index mean high randomness in the data.

Entropy

Entropy aims to quantify how unpredictable a dataset is.

  • The formula to calculate entropy is given below. Here pi is the probability of choosing an element from a class labeled as i, given n total classes.
  • If the data consists of elements belonging to a single class, it becomes highly predictable, so the entropy will be the minimum. And minimum value of entropy is 0.
  • When the data consists of elements belonging to n classes which are equally likely, each having probability = 1/n, entropy will be the maximum.
  • For binary classification (i.e., data with two classes), the value of entropy will never exceed 1.
  • For multi-class classification, the maximum value of entropy can be generalized as log(n). (Here log is to the base of 2.)

Gini Index

Gini index aims to quantify the probability of incorrectly labeling an element chosen randomly from the data.

  • The formula is given below:
  • If the data consists of elements belonging to a single class, the probability of incorrectly labeling a randomly chosen element will be zero, hence the gini index will be the minimum in this case. So, the minimum possible value of gini index is also 0.
  • When the data consists of elements belonging to n classes with balanced distribution i.e. each class has equal probability 1/n, gini index will be the maximum.
  • For binary classification (i.e., data with two distinct classes), the maximum value of gini index will never exceed 0.5
  • For multi-class classification, the maximum value of gini index can be generalized as 1-(1/n).

Related Articles