The world’s leading publication for data science, AI, and ML professionals.

Under the Hood: Using Gini impurity to your advantage in Decision Tree Classifiers

This article will serve as the first part of a potentially ongoing series, looking at the mathematical concepts that drive key parameters…

Under the Hood: Gini Impurity

Photo by unsplash.com/@andrewtneel
Photo by unsplash.com/@andrewtneel

This article will serve as the first part of a potentially ongoing series, looking at the mathematical concepts that drive key parameters in the machine learning algorithms employed in data science. My goal in these posts will be to express key concepts in as simple and non-technical a language as possible, while not sacrificing any critical precision to the concepts themselves. As any good data scientist knows, there is far more to understanding your data than importing an estimator from sklearn or statsmodels and typing .fit() in your script, even if sometimes the most experienced of data scientists can make it appear that simple. There is a lot of problem solving, critical thinking, and even a bit of artistry involved, with each and every data problem having its own unique approaches, limitations, and possibilities.

To be effective, especially in those frustrating and challenging moments when your model just isn’t performing to your hopes or expectations, it is necessary to be familiar and comfortable with the mathematical concepts "under the hood" in your machine learning algorithms. Adjusting a parameter or two at random, or grid searching blindly over them all, may sometimes get you where you need to go, but this is inefficient, and potentially massively time wasting, especially when your models take hours or longer to train, or your deadline is closer than you’d like. There are also times when no amount of fine tuning will fix problems inherent in your project, and recognizing this ahead of time can save you several potentially lost days of frustrating and fruitless work. In this article, I will dive into the default metric that decides how a decision tree classifier generates the nodes that it does – the Gini impurity.

Note: This article assumes that the reader is already familiar with the concept of the Classification and Regression Trees (Cart) in machine learning. If this is unfamiliar to you, here is a great introduction to them as a concept.

Gini Impurity – what is it?

First of all, the Gini impurity is a loss metric, which means that higher values are less desirable for your model (and for you) than lower values. Secondly, it is limited to the classifier variant of the decision tree models, as it must be subject to discrete target values (or classes) in order to give a coherent value. We will get more into the details in a moment, but to summarize, the Gini impurity identifies the probability that a randomly chosen class in a subset of your data will be mislabeled if you randomly assigned a label to that class. The lower the probability, the "purer" the class, or the higher the probability that your random assignment of class will be correct. For example, if a subset of your data contains only 1 class, the probability of mislabeling that class is 0. In these ways, the Gini impurity is similar to many other loss metrics and functions that a data scientist may be familiar with, such as the mean squared error for standard regression models, or the log-loss metric in many standard Classification problems.

The key difference, however, is fact that the Gini impurity as it functions in a decision tree is a greedy algorithm. What does this mean? In most machine learning cases, an algorithm randomizes various weights, evaluates some loss metric, and then uses gradient descent to attempt to find the global minima, or the weights at which the loss metric is at its lowest for the entire training data. This can be time consuming, but the decisions made when adjusting the weights of the model in question are usually global decisions that will hopefully generalize well to unseen data— holding in mind the end result, which is minimizing the loss metric.

Gini Impurity, being greedy, makes decisions at each node only considering the best results at that node, not necessarily with consideration to later node decisions. In other words, the short term decision will always be the "best" decision, but may not identify the "best" decision long term for later nodes in the tree. The two main upsides to this are a relatively fast training time (and decision trees are often seen as one of the fastest models to train), and the fact that your model will always find a "good" decision (in other words, you will not see issues such as failures to converge your weights, as you might see in other models such as Logistic Regression). The downside is that you almost certainly will not find the "best" decisions possible for the model (meaning that you will almost certainly not find the global minima that can be found). In addition, the greedy "short term" approach is one of the biggest reasons that a single decision tree almost always overfits your training data, failing to generalize well to validation data. The locally best choices are very likely going to pick up on noise as well as signal in your training sample, and different samples from the same population will therefore not likely be predicted with similar accuracy.

How does it work?

The formula for Gini impurity is:

where C represents the total number of classes in the subset of data, and i is a selected class out of C. To those who don’t have a strong mathematics background, it is possible that this formula may look slightly intimidating, but it is actually remarkably simple. The probability of selecting class i (this is p(i)) times the probability of not selecting i (which is 1 – p(i)) is very similar to a typical binomial distribution with 2 trials (1 with event k, and 1 without event k). When you sum these for every i in your set of classes C, what this gives you is the total probability of selecting any random data in your subset incorrectly if you randomly assign a class to it: or in other words, a statistical expression to the "purity" of the classes in your subset of data.

Here’s a brief and simple example to help with understanding. Let’s say I have 4 marbles in a bag: 2 blue, and 2 green. The probability of picking a blue marble is 2 / 4, or 50%. I also have a 50% probability of not picking a blue marble. The probabilities of picking or not picking a green marble are exactly the same, so we would expect a "purity" score of 0.5 (because there are two evenly distributed classes). Here, briefly, is the formula for the Gini impurity for this little statistical experiment:

Gini = p(B) (1 – p(B) + p(G) (1 – p(G)) = 0.5 0.5 + 0.5 0.5 = 0.25 + 0.25 = 0.5

The takeaway here is, if I pull a marble out of the bag, and before looking at it state "This marble is green," I have a 50% probability of being correct. Now, that’s all well and good, but how do we interpret this? Is this a "good" impurity score, or a "bad" one?

To understand this, it’s important to know one more little equation, which is the key to properly interpreting the Gini impurity score and what it can tell you about your data. Since I defined the total number of classes as C above, the possible range of Gini impurity scores for the experiment are a minimum value of 0 to a maximum value of 1 – (1/C), where 0 means a perfect purity of one class, and (1–1/C) being a completely uniform distribution of each class, and the least possible amount of purity. With those two classes, this means that the worst possible score I could have gotten was 1 – (1/2) = 1–0.5 = 0.5. So, this is an indication that my classes are not at all "pure" in this subset of my marble bag. If I had 3 classes instead, the least pure score possible would be 1 – (1/3) = 0.6667. The more classes you have, the higher an impurity that can result.

So, the way the decision tree works (keeping in mind that it is greedy) is that, at any given node, the model looks at only that subset of data, iterates over possible ways to "split" that data (the "decision" of that node), and selects the one that generates the lowest Gini impurity. Then, it advances to the resulting child nodes and does the same, until it is able to achieve an impurity of 0. Part of what makes the fitting time so fast for a typical decision tree is that, at each advance of depth, the model is looking at continuously smaller subsets of your data, as previous nodes have limited the subset that goes into a current node.

What can this tell you about your data?

To start off, simply looking at the resulting tree can help you get a sense of what is happening with your data. This post offers many useful ways to visualize the decision tree that results from a sklearn model. In particular, skearn’s plot_tree method is a very useful tool to see how many samples split into resulting child nodes, what values of your features are designated to split, and how much the Gini impurity is reduced in these nodes. This can tell you what a good next step in model improvement can be.

For example, if your decision tree has a large depth, with only small decreases in Gini impurity in each level of new nodes, this can be an indication that the target classes are not well separated by the features passed into the model. You can artificially limit the depth of the tree by passing a limiting argument max_depth, or limit how small a leaf node’s sample size can be by passing min_samples_leaf, which may simplify an overly complex model. In addition, the argument min_impurity_decrease will allow you to instruct the model to only make "valuable" splits, only separating the data if the Gini impurity is reduced by a given amount. If none of these adjustments lead to desirable results, returning to the EDA stage may be necessary to identify better features to engineer or pass into the model. If there is still difficulty in identifying strong learners in the features you have available, your data may benefit at this stage from unsupervised learning, such as principle component analysis or k-means clustering, for the purposes of better separating your target classes.

On the other hand, if a decision tree quickly reduces the Gini impurity with relatively small depth and few nodes, but the estimator tends to default to always predicting a dominant class (or if the model’s sensitivity or specificity are not acceptable for the project), the model is likely impaired by an inherent imbalance of the target classes. The best hyperparameter to adjust in this scenario is the class_weight argument, which will act as a regularization that more heavily penalizes the model for a given class. Say, if class 1 is less represented, but you pass a weight that is larger than 1 for class 1, this will more heavily penalize the misclassification of that class, which can help improve the sensitivity. This could, for example, look like:

class_weight = {0:1, 1:5}

In this example, the weights for class 1 are higher (set to 5), which will more heavily penalize the model for misclassifying the positive class compared to the negative class. Beyond this, the best course of action is to acquire more data, or to resample your data to either oversample the smaller class, or undersample the dominating class, in order to better balance the classes in the first place.

In conclusion

Knowing how to tune the hyperparameters of a model when your results are not satisfactory – or for that matter, knowing when you have encountered limitations that are just inherent to your data, and no amount of tuning will improve your scores until those limitations are addressed – are very important skills to be efficient when working on a data problem. A data scientist can have this kind of confidence in their response to an under performing model only if they in turn have confidence in their knowledge of what it is that model is doing.

The takeaway here is that when you come across issues in the performance of a model, do not immediately resort to grid searching in the hopes of stumbling onto the best results. Take the time to think about what the performance of the model is telling you, and utilize the metrics and arguments of the estimator to make smart decisions. Take the time to familiarize yourself with the metrics and equations utilized by Machine Learning algorithms, and you will better be able to adapt to the unique problems you may encounter. By seeing what the performance of a model is telling you about your data, you will know which hyperparameters to adjust when your model needs tuning, or when it’s time to try a different approach. Learning these skills can make you far more efficient and effective as a data scientist, and can remove hours of frustration and wasted time in future projects.


Related Articles