Table of Contents
- Introduction
- Example Dataset
- The Decision Tree
- How it Works
- Making Predictions
- Impurity Scores
- Continuous Variables
- Impure Leaves
- Conclusions
Introduction
Tree-based models are some of the most widely used models today; they are very powerful, easy to implement and provide feature importances to help with interpretability. One of the most widely used tree-based models is the Random Forest, which is based on Decision Trees.
In order to understand Random Forest, it is essential to know what the underlying model, the Decision Tree is doing.
Despite its ease of use, it can be a tricky Algorithm to explain to a non-technical audience. This guide explains the Decision Tree using a simple example. Some maths is included here, but you might want to omit this depending on your audience.
Decision Trees can be used for both Classification and Regression. We’ll focus on Classification here.
Example Dataset

For explaining everything here on out, we are going to use a hypothetical dataset about apples, pears and grapes. I like using this example in explanations because it’s very visual, but the features you can pull from these are very variable.
Here’s what data we might find in our example dataset…

- Shape is binary categorical
- Diameter is continuous
- Fruit is categorical
The Decision Tree

How It Works
The Decision Tree is a concept most people should be aware of, as it crops up in many other fields. Let’s dig in.
Imagine you are trying to separate the above data. You don’t know what fruit it is but you have the rest of the attributes. What feature would you use first to create the most separation?
I’m going to start with size. The blue box in the diagram below represents our Root Node this is the first decision we make. The yellow circles are the options and the green boxes are leaves.
Here, we’ve arbitrarily chosen size and 4cm. There is actually some simple maths we can use to decide both of these, which we’ll get onto later.

As expected, choosing a size threshold of 4cm fully separates the grapes from the apples and pears. The grapes leaf can be described as pure because it only contains grapes.
In our dataset, only apples have the colour red. Let’s use this as our next decision boundary.

This creates another pure leaf on the yes side, but on the other side, there is still a mix of apples and pears that we need to classify.
Finally, let’s use the shape. In our dataset, this is binary; round or tapered.

And that’s it! We have 2 pure leaves and fully separated apples and pears.
Making Predictions
We now have a full tree that could be used on unseen data to classify the fruit.
The full tree is below, grab a piece of fruit and a ruler and give it a try!

We can also visualise the decision boundaries. The below chart shows a theoretical sphericity measure where the boundary between round and round taper is around 0.7. Colour would be the third dimension in this chart.
We can easily visualise 2D decision boundaries, as you can see with the vertical diameter and horizontal sphericity boundaries. These plots are a very useful way to visualise how the data is being split.

There are a few questions still outstanding here that we’ll tackle in the next few chapters. These are:
- How do we decide which feature to use as a node?
- If that feature is continuous or has multiple categories, how do we pick what to use?
- What happens if our data isn’t easy to classify, how do we get pure leaves?
Impurity
If a leaf is not pure (it contains a mixture of classes), we can measure its impurity by looking at the proportions of each class inside the leaf. The two most common ways to calculate purity are using either the Gini or Entropy score.
Variance is used for Regression.
Our algorithm aims to minimise impurity at each decision boundary using these scores, which means at each decision boundary, it calculates the Gini scores and selects the feature that will produce the best scores.
The Gini formula is as follows:

This is simply 1 minus the sum of probabilities for each class, squared.
The sum of probabilities is just the occurrence of each class divided by the total. I have calculated the Gini impurity for each leaf below, if we’d used our original decisions as the root node.

To get the Gini score at each root, we just take the weighted average of the leaves. This prevents the difference in sample sizes from causing an issue.
In the example above, for size, we would do the following:
- Yes = 0.50 * 12/18
- No = 0.00 * 6/18
Summing these gives a Gini score of 0.33
We can repeat this for every node, with the minimum Gini score being the node we use. Please note Shape and Size do give the same result here.
Entopy is used less often than Gini. It uses a log formula which can give different results in some cases.
The impurity score is calculated only at the current node (ie for just one feature). It is not calculated for what the full tree will be if each feature is used first. This is known as a greedy algorithm.
Now we know how the decision is made for which feature to use at each node, but how do we decide what to use for continuous variables, or those with multiple categories?
Continuous or Multi-Categorical Variables
These are actually very simple. With continuous variables, the list is sorted and the Gini score is calculated at the midpoint of each value. Here’s an example for each value, we would test 1.1, 1.5 and so on and take the boundary that produces the best Gini score.

Multi-Category variables (eg Red, Purple, Green) are also straightforward. We just test each category and pick the one the produces the best Gini score!
What If We Can’t Get 100% Purity?
By default, decision trees will continue to split data until all leaves are pure. This means trees can get very deep if our data is difficult to separate for example, if we had a particularly round pear, or a very small apple.
There is a ways to prevent this, in Scikit Learn we can limit the depth or number of leaves in the tree to prevent it getting too deep. If this happens, we might end up with leaves that aren’t 100% pure.
In the case of impure leaves, the model will predict the majority class, but the probability will not be 100%.
For example, if we had a leaf with 3 apples and 1 pear and we tested on a new observation, if the criteria for that leaf was met, the model would predict an apple with 75% probability.

Conclusions
Nowadays, models are becoming more advanced and complex so it’s more likely that you will want to use one of many powerful tree based models such as Random Forest, XGBoost or LightGBM. However, these models rely on concepts such as bagging or boosting applied to many Decision Trees. Understanding the Decision Tree provides a framework for understanding these powerful, widely used models and will enhance your ability to explain and interpret them.
Benefits of Decision Trees
- Decisions Trees are very simple to implement using Sklearn
- We can produce a plot showing the full tree
- Missing values don’t need to be imputed
- Features don’t need to be scaled
- We can extract feature importances
- Decision Trees are very easy to interpret
Issues With Decision Trees
The biggest issue with the decision tree is tree depth. As mentioned above, the tree will continue to grow until it achieves purity. Having very specific decision boundaries leads to overfitting. This is when our model has learnt the train data almost perfectly and doesn’t perform well on data it hasn’t seen before.
‘Pruning’ the tree by reducing the leaves does reduce overfitting, but can also limit our model’s predictive power. So how can we improve on Decision Trees?
This is where the Random Forest model comes in. Read about it here:
Contact me:
Adam Shafi – Data Science Immersive – General Assembly | LinkedIn