3 decision tree-based algorithms for Machine Learning

In this article, I will discuss some of the most widely used DecisionTree-based algorithms for machine learning.

Akash Shastri

Published in

Towards Data Science

6 min readNov 18, 2020

Decision Trees

What are they?

Decision trees are a tree algorithm that split the data based on certain decisions. Look at the image below of a very simple decision tree. We want to decide if an animal is a cat or a dog based on 2 questions.

Are the ears pointy?
Does the animal bark?

We can answer each question and depending on the answer, we can classify the animal as either a dog or a cat. The red lines represent the answer “NO” and the green line, “YES”.

This way the decision process can be laid out like a tree. The question nodes are called root nodes and the answer/terminal nodes are called leaf nodes.

Simple Decision Tree. Image by the author.

Which division comes first?

When we have a table of data with multiple columns, Decision trees split the result based on columns. For the above DecisionTree, if we had a table like shown below, how do we decide which split comes first? Do we split first based on pointy ears or barking?

We can use certain metrics to decide which column is the best classifier. There are 3 popular metrics that are used.

Gini Impurity.
Information Gain.
Variance reduction.

Gini impurity as far as I have seen is the most popular, so let’s take a look at it.
Classifying based on a single variable, we have the following stumps (single node trees).

Decisions and correctness. Image by the author.

Look at the above decisions and the number of correct and incorrect predictions they made. At each leaf node, the Gini impurity is calculated as below:

Gini at leaf = 1 -(“yes” probability)² -(“No” probability)²

So Gini for the first Node’s leaves will be:
Leaf 1: 1 -(3/5)² -(2/5)² = 0.48
Leaf 2: 1 -(4/5)² -(1/5)² = 0.32

The Gini impurity for the entire node is the weighted sum of the Gini at each leaf node. So the Gini for the first node is:
(0.48)*(5/10) + (0.32)*(5/10) = 0.4

The weights for the sum, in this case, is the same as both classes contain 5 elements each. This is not usually the case and that is when the weighted sum of Gini is important.

Similarly, the Gini impurity for the second node will be: 0

As the Gini Impurity is lower for the second node, that should be our first split. This way, we decide the order in which features are used to split data.

How to split based on continuous variables?

How do we split continuous variables or multicategory variables? Continuous variables are a little trickier, the key idea is to arrange the continuous variable in order, and at each data point, take the average between the previous and the current data point and split the data based on the average, and calculate the Gini impurity. After going through the entire data, select the split with least Gini Impurity.

Example: Consider a list of heights and a classification as tall or short. At each data point (height), we split based on average of the current and previous data. Say the average at step i is 150 cm, then we assume all data less than 150 cms is short and greater than 150cm is tall. Then we calculate the gini impurity based on how many correct and incorrect predictions we made. Then we choose another split and check the gini impurity of that split, till we find the split with lowest impurity.

A similar strategy is used for categorical variables with more than 2 classes.

If this explanation is a little confusing, or you need further clarification, watch this amazing video by Statquest on youtube.

Decision Trees for Regression

So far we have only discussed decision trees for classification, but they can also be used for regression. The process for selecting decision-splits and important features is very similar to that of classification trees. The difference is how the impurity is calculated and how the output is predicted.

Consider a tree with 1 node (also called stump) and 2 leaf nodes. After the data is split, all the values of each of the leaf nodes are averaged, and that average is the prediction for that leaf node.

The difference between the prediction of each data point and the actual value is called the residual. The sum of squares of all the residuals after the split is the impurity metric. The split with the lowest impurity or the split with the lowest sum of squared residuals is chosen.

Random Forests

The problem with Decision trees is that they overfit the data. They learn to split the training data to lower the metric but end up doing so in such a way that it overfits the data and the model does poorly on unseen data.

There are 2 main ideas to fix the overfitting of Decision Trees.

Bootstrapping
Ensembling

Bootstrapping

Bootstrapping can be thought of as a sampling method. The main idea is to randomly select a subset of the data, on which the decision tree is trained. This makes the Decision tree overfit less (because it sees lesser data), however, it does nothing to increase the validation accuracy. This is only part of the solution.

Ensembling

We reduced overfitting by bootstrapping, and now our tree fits only on a subset of data, but how does this help?

This is where the second part of Random Forests comes in. The idea of random forests is to make the decision tree a weaker predictor but to have many predictors and consider a central tendency (mean-average or mode-highest votes) of the set of predictors.

Random forests combine bootstrapping and ensembling. The data is randomly sampled, and a decision tree is trained on each of the sets of randomly sampled data. This way there are many decision trees each trained on a different random subset of the data.

As no tree has seen the entire data, they cannot overfit the data, and as there are a lot of trees, the errors in prediction are driven lower by averaging the outputs of all the trees.

This way with ensembling and bootstrapping, random forests achieve far superior performance as compared to any single Decision tree.

Gradient boosted Decision Trees

Gradient boosting has a similar approach to random forests, in that it chooses weaker predictors, but multiple in number to both avoid overfitting and achieve top tier performance. Gradient boosted trees can be used for both classification and regression.

Mechanism

Consider a regression problem. Gradient boosted trees have the following approach.

Step one is to always predict the average of all the target values. This is a baseline prediction of the target value. Once we have a baseline, each target variable is subtracted by the baseline prediction, this gives the residuals for step one.

residuals = target — prediction

Step two is to fit a decision tree for the residuals, where the output of each leaf node is the average of residuals in the leaf node. Now to predict the target, we scale the output by a value called learning rate (lr) and add this scaled value to the previous prediction. The sum is the new prediction, and we can repeat the above steps by calculating a new set of residuals.

Pred ᵢ = Pred ₍ᵢ₋₁₎ + lr * output

Where output is the average of residuals in the leaf node

The residual is analogous to the derivative of the loss function in neural networks. As the residual is the difference between the target and the prediction, if the residual is positive it means our prediction was lower than the target, and adding a scaled residual to our prediction will make it closer to the correct target. Similar idea for negative residuals.

When multiple steps of gradient boosted trees are stacked, the prediction will be remarkably close to the prediction, and the model will perform very well on unseen data as well.

If you are a visual learner, I highly recommend StatQuest on youtube. It is an amazing resource where Josh explains these algorithms and more with animations and quirky songs.
Also, the best way to practice is to download a kaggle dataset, open a jupyter notebook, dive in and just explore :))