The world’s leading publication for data science, AI, and ML professionals.

Classification Decision Trees, Easily Explained

Decision Trees are one of the building blocks for more advanced machine learning models. When it comes to classification problems, how…

Photo by @szmigieldesign @ unsplash.com
Photo by @szmigieldesign @ unsplash.com

No matter how experienced you are in Data Science, you’ve probably heard about Decision Trees. Loved by many, this simple explainable algorithm is the building block for many algorithms that have won a ton of machine learning competitions and steered projects towards success.

Random Forest, XGBoost or LightGBM are some of the algorithms that lean on this simple, yet effective idea of building an algorithm based on if-else rules.

How exactly is a Decision Tree built? How does it choose the variables and the thresholds that are going to compose its rules? If you always wanted to understand how this algorithm works, you are in the right place.


Let’s start with a simple role-play challenge.

We have boxes that contains two ostrich eggs – we need to ship these boxes overseas to other countries. We’ve done an experiment and sent 38 boxes to test if both eggs arrived safely at their destination – we then plot the weight of both eggs in a scatter plot, trying to understand if there is any relationship between these variables and the package outcome:

Weight of Eggs and Validity of Package
Weight of Eggs and Validity of Package

Blue points consist of boxes that arrived safely at the destination. With the red points, unfortunately, one or both of our eggs broke and our package was considered invalid.

(Let’s save some ostriches!)

After checking with the courier, you understand that when you send two eggs that are very heavy, they collapse into each other and may break. When you send two eggs that are too light, they have too much space to move inside the box and collapse into its boundaries. To avoid wasting a lot of boxes, you want to build an algorithm that will tell you which boxes should be valid for overseas travel.

It seems pretty easy for a human to perform some kind of if-else rules and draw some boundary in the example above. A quick sketch similar to this:

If-Else rules to separate Valid and Invalid Packages
If-Else rules to separate Valid and Invalid Packages

These quadrants produce the following if-else rules:

  • If weight of Egg 1 > 2.66 and weight of Egg 2 > 1.95, there is a high chance that the package will arrive broken.
  • If weight of Egg 1 < 2.66 and weight of Egg 2 > 1.95, there is a high chance that the package will arrive in good condition.
  • If weight of Egg 1 < 2.66 and weight of Egg 2 < 1.95, there is a high chance that the package will arrive broken.
  • If weight of Egg 1 > 2.66 and weight of Egg 2 < 1.95, there is a high chance that the package will arrive in good condition.

With these rules, we can easily set up a system that would let us decide if we should send the package overseas or not.

Based on this system, let’s visualize our classifications:

So.. if humans can do this on themselves, why use an algorithm? The main problem is that most datasets are not so tidily arranged as the one we see above (99.9% sure that most real life ones don’t look like that, I can assure you) – and what about drawing these boundaries with more than 3 variables? Insanely impossible!

That’s why we need machines to do that work for us.

Enter Decision Trees!


Decision trees are actually pretty simple and can be summarized in a "simple" sentence: "decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so". Let’s cut the jargon and break this sentence down, mathematically.

Imagine that our first split would not be an optimal one – Weight of Egg 1 over or less than 1.5:

Sub-optimal Split for our Tree - Weight of Egg 1 ≥ 1.5
Sub-optimal Split for our Tree – Weight of Egg 1 ≥ 1.5

This decision boundary doesn’t really divide the classes within "boxes" or quadrants. Using this rule would give us a sub-optimal classification because we can’t separate valid from broken packages well enough. Let’s draw a real tree and expand our argument, just for fun:

Tree Generated for split of Weight of Egg 1 ≥ 1.5 kilos (icon attribution: Stockio.com)
Tree Generated for split of Weight of Egg 1 ≥ 1.5 kilos (icon attribution: Stockio.com)

In the right branch (we’ll get to the anatomy of a tree in a bit), we have 19 valid packages and 9 broken ones. When then weight of Egg 1 is higher, there is a higher probability (19/28 or around 68%) that the package will arrive in good condition. The probability that the package will arrive broken is, on the other hand, 9/28.

If you look at the left node, the probability of a both valid or broken package is exactly the same, 5/10 or 50%. It’s harder for us to classify points below the 1.5 threshold because there is no difference between the number of examples in each class.

It seems that this would be basically a coin throw,— in "tree-speaking" terms, this means that this is a really impure node.

The key to understand if this is a good split or not is to quantify the impurity of both nodes produced by this split. Starting with the right node:

Right node of our Decision Tree with split - Weight of Egg 1 ≥ 1.5 (icon attribution: Stockio.com)
Right node of our Decision Tree with split – Weight of Egg 1 ≥ 1.5 (icon attribution: Stockio.com)

Impurity starts with probability, we already now the following:

  • Probability of valid package – 19/28 = 67.85%
  • Probability of broken package – 9/28 = 32.15%

We now introduce a really important concept called Gini Impurity— this is the concept that will let us attach a value to each split – if you’ve studied other algorithms, this will behave similarly to a cost function – we want to select the split that has minimum Gini Impurity.

To make a final decision on the node, we need to calculate partial impurity for each node in the split – for this node, the partial gini impurity is the following formula(where we plug the probabilities of valid and broken package):

This will yield a value of 0.43622. A lower partial impurity means a node that is better at separating the classes.

Ok, we have the values for the right node – let’s compute the partial impurity for the left node, starting with the probabilities:

Left node of our Decision Tree with split - Weight of Egg 1 < 1.5 (icon attribution: Stockio.com)
Left node of our Decision Tree with split – Weight of Egg 1 < 1.5 (icon attribution: Stockio.com)
  • Probability of valid package – 5/10 = 50%
  • Probability of broken package – 5/10 = 50%

Now we can calculate the partial impurity for the left node:

The partial impurity for the right node is 0.5. Notice how it is higher because we have a tougher time when we want to separate the classes.

How do we combine both partial impurities? That is the easy part! We will weight the average of the impurities according the number of examples of each node, crunching this step-by-step:

  • 28 examples on the right node;
  • 10 examples on the left node;
  • Partial impurity of the right node = 0.43622;
  • Partial impurity of the left node = 0.5;

The impurity of the split as a whole is (drum roll!):

Uff, finally! The gini impurity for this split is 0.4565 -we now have a single value that we can attach to this threshold and that can be compared with other potential candidates.

By design, we will want to select the split that will achieve minimum impurity – because that split will translate into better dividing the classes.


Let’s test another split! For instance, Weight of Egg 1 > 2.3:

Split for our Tree with Weight of Egg 1 ≥ 2.3.
Split for our Tree with Weight of Egg 1 ≥ 2.3.

This split generates the following tree:

Tree Generated with Weight of Egg 1 ≥ 2.3. (icon attribution: Stockio.com)
Tree Generated with Weight of Egg 1 ≥ 2.3. (icon attribution: Stockio.com)

It seems that this isn’t a good split at all! Both classes seem balanced in both leaf nodes. "What’s a leaf?" – you ask.

Before calculating the Gini Impurity for this split, it is relevant to check the anatomy of decision trees:

Anatomy of a Decision Tree (icon attribution: Stockio.com)
Anatomy of a Decision Tree (icon attribution: Stockio.com)
  • Root Node is the base of our tree, where we have all our boxes of eggs before the split.
  • Branch consist of each path that the boxes will take after applying a if-else rule.
  • Leaf Node consists of the node where the box will end up, depending on the branch above.
  • If the tree has more than one level of deepness (we’ll get there in a minute!) then the leaf nodes become internal nodes.

Let’s crunch the numbers for this split and get to our Gini Impurity:

  • The partial impurity for the right node is 0.4753
  • The partial impurity for the left node is 0.455

Given this, our Gini impurity for this split is (drum roll!):

The value for the Gini Impurity for this split is worse than the first split we’ve tested. This makes sense because most classes seem balanced in both of the nodes.

So… How do we find the best split?


Recursively search the entire space of possible splits might be a good idea! That’s actually what Decision Algorithms do, internally.

If we calculate the Gini Impurity for all the possible splits in the "Weight of Egg 1" variable – we will have a plot with this look:

"Cost Function" Plot for Gini Impurity across all possible thresholds in Weight of Egg 1
"Cost Function" Plot for Gini Impurity across all possible thresholds in Weight of Egg 1

The plot above state that the impurity is minimized somewhere when we chose "Weight of Egg ≥ 1" as a threshold – let’s check if that’s true, visualizing our tree with that split:

Tree Generated with Weight of Egg 1 ≥ 1. (icon attribution: Stockio.com)
Tree Generated with Weight of Egg 1 ≥ 1. (icon attribution: Stockio.com)
  • The partial impurity for the right node is 0 .444;
  • The partial impurity for the left node is 0.48;

The Gini for this split can be calculated using:

This matches the minimum split that we achieved in the recursive search and should be our first split (should it?)!

Remember that we have two variables! What if Weight of Egg 1 is not the best variable to use to split our classes?

Let’s check the impurity plot for every split in the Weight of Egg 2 variable:

"Cost Function" Plot for Gini Impurity across all possible thresholds in Weight of Egg 2
"Cost Function" Plot for Gini Impurity across all possible thresholds in Weight of Egg 2

We may have a candidate on weight of Egg 2 – the interval between 0.95 and 1 may be a good first split.

Just to make this clear, what we are now testing is this:

Split for our Tree with Weight of Egg 2 ≥ 0.95.
Split for our Tree with Weight of Egg 2 ≥ 0.95.
Tree Generated with Weight of Egg 2 ≥ 0.95. (icon attribution: Stockio.com)
Tree Generated with Weight of Egg 2 ≥ 0.95. (icon attribution: Stockio.com)
  • The partial impurity for the right node is 0 .444;
  • The partial impurity for the left node is 0.48;

This is the exactly same pattern of what we saw for the "Weight of Egg 1" best split! So, our final calculation will be, again:

We have our starting point for our Decision Tree! Depending on the system that you will use to fit your Tree to the data, one of two might be chosen:

  • Weight of Egg 1 < 1 and Weight of Egg 1 ≥ 1
  • Weight of Egg 2 < 0.95 and Weight of Egg 1 ≥ 0.95

Why? Because these are the thresholds that minimize the gini impurity – we have a tie in our example, something that is not common in real world datasets.

But.. where do we go from now?

We grow our tree!


Let’s focus on our left leaf node:

Tree Generated with Weight of Egg 2 ≥ 0.95. (icon attribution: Stockio.com)
Tree Generated with Weight of Egg 2 ≥ 0.95. (icon attribution: Stockio.com)

The left node corresponds to the all the examples below the threshold. If we recursively do a search on only these examples – what we will find is a completely pure split – notice the new vertical threshold, in orange:

New Threshold with Weight of Egg 1 ≥ 2.9.
New Threshold with Weight of Egg 1 ≥ 2.9.

This corresponds to the following tree:

Tree Generated with two Levels of Thresholds (icon attribution: Stockio.com)
Tree Generated with two Levels of Thresholds (icon attribution: Stockio.com)

The leaf node that is a result of the branch "Weight of Egg 2 < 0.95" is now an internal node. Why? Because it will feed more branches going forward!

The new leaf nodes are the ones produced by the branch using "Weight of Egg 1" that you can see immediately below the internal node.

If you make the calculations, these new nodes will have:

  • Left node has partial impurity=0.
  • Right node has partial impurity=0.

What is the Gini impurity for this split? That’s right, 0!

This means that this is a pure node because it splits our classes perfectly. We can’t divide our classes after this point.

What about the other branch? By iterating again on our Classification Tree we will reach a new threshold:

Tree Generated with two Levels of Thresholds (icon attribution: Stockio.com)
Tree Generated with two Levels of Thresholds (icon attribution: Stockio.com)

Another pure node! The details of Gini Impurity for this split is:

  • Partial Impurity Left Node: 0
  • Partial Impurity Right Node: 0.4770408
  • Gini Impurity for split: 0.4047619

Now we can continue to go down our tree to produce more thresholds until we achieve node purity across all leaf nodes.

The final thresholds generated are:

Notice how all points are grouped in their small "box". Here is how we are going to classify our points going forward:

If we receive a new set of boxes, we will be put them into one of these squares, depending on the shade of the square we will know if we should send our eggs for shipping or not— we’ve built our classification Tree algorithm!

A challenge: Can you draw the tree generated by these thresholds on your own?


Final thoughts

  • Achieving node purity is not the only (and probably not the best) criteria to stop a tree from growing. Achieving node purity comes with a high probability that your model will only fit the training data and not generalize well for the real world (the real goal of every predictive model).
  • Related to the argument above, decision trees are really prone to over-fit. You can isolate almost every example you have on your sample based on if-else rules but that will, almost certainly, lead to an overfitted algorithm.
  • Sometimes, small changes in the input (selecting a specific sample from the original data) may cause a huge difference in the results of the outcome of a tree.
  • The next part to study regarding decision trees is to understand the concept of hyper-parameter tuning and Cross Validation. They are key to guarantee that your decision tree will be robust enough in the future with a new sample – we will explore these concepts in the next blog post.
  • Decision trees have been rarely used in the production since the introduction of Random Forests, because Random Forests are more robust, generally— we will also approach random forests in a subsequent blog post.
  • Gini Impurity is not the only formula to decide best split (Entropy, for instance, may also be used as a criteria).
  • For Regression Trees the rationale is the same, but instead of using Gini Impurity, one would use an error metric from regression (normally, root mean squared error).

Thank you for taking the time to read this post! I hope you have enjoyed it and it helped you to finally understand the math and the logic behind Decision Trees.


Want to know more? Check out my R for Data Science Course on Udemy where we will approach more algorithms and dive deeper in data science concepts.

This course is ready to be your first step as a Data Scientist – we will approach all the concepts slowly and taking care of understanding the basics and the theory behind the algorithms. This course is your risk-free (30-day refund policy) opportunity to enter the world of Data Science and I would love to have you a student!


Related Articles