
In my last article, I showed how you could have some fun growing a Random Forest using Sklearn’s DecisionTreeClassifier. You can check it out below:
Growing a Random Forest using Sklearn’s DecisionTreeClassifier
If you are new to Machine Learning (ML) or are in the process of becoming a Data Scientist or ML practitioner like me, you will probably get something out of it. In my case, preparing the material for the article helped me gain a good understanding of how Random Forests and Decision Trees work under the hood without getting into complex code.
In that article, I mentioned that there are many algorithms that can be used to build a Decision Tree. One of them is ID3 (Iterative Dichotomiser 3) and we are going to see how to code it from scratch using ONLY Python to build a Decision Tree Classifier.
All the code can be found in a public repository that I have attached below:
Modelling the nodes of the tree
A Decision Tree is formed by nodes: root node, internal nodes and leaf nodes. We can create a Python class that will contain all the information of all the nodes of the Decision Tree.
The class Node will contain the following information:
- value: Feature to make the split and branches.
- next: Next node
- childs: Branches coming off the decision nodes
Decision Tree Classifier Class
We create now our main class called DecisionTreeClassifier and use the init constructor to initialise the attributes of the class and some important variables that are going to be needed.
Note that I have provided many annotations in the code snippets that help understand the code.
In order to calculate the entropy note we are using the private function _get_entropy( ). The code for this function is provided below:
We pass the instances id’s or indexes to this function. For doing this, we need to generate an unique number for each instance. Python’s lists comprehensions come in very handy for this task as you can see.
We are going to code an ID3 algorithm that uses the information gain to find the feature that maximises it and make a split based on that feature. The information gain is based on entropy.
As this article is about coding the ID3 algorithm, I am not going to go into the details but entropy is a measure of uncertainty about a random variable. If we minimise the entropy, then we increase the certainty about the variable. In other words, if the random variable can take only one value the entropy reaches its minimum whereas if all the values are equiprobable the entropy is maximum.
The algorithm will try to minimise the entropy or, equivalently, maximising the information gain.
We can compute the entropy with the following formula:

where pi is the proportion of each category i=1,…,N.
We said that we would compute the information gain to choose the feature that maximises it and then make the split based on that feature. The information gain is a concept based on entropy. It is defined as the total entropy minus the entropy if we chose a particular feature j.

So we create another private function that computes the information gain:
_get_information_gain( ) takes the instances ids and the feature id of the selected featured to be evaluated. Then it calculates the total entropy, the entropy if we selected the feature specified in feature_id and finally the information gain.
This ID3 algorithm chooses the feature that maximise the information gain at each split. That is if the amount information in the feature is large, the entropy will be small, and therefore the second term of the information gain formula will be also small, increasing the information gain.
_get_information_gain( ) only calculates the information gain for a given set and feature but, at each split, we need to figure out which is the feature that maximises the information gain. We need to create another function to find the feature that maximises the information gain.
Ok..so we have a private function (_get_feature_max_information_gain( )) that find the feature that maximises the information gain, that uses another private function (_get_information_gain( )) that calculates the information gain, that uses another private function (_get_entropy()) that computes the entropy. We are all set! We can start coding the ID3 algorithm that will create our ID3 Decision Tree for classification problems.
We create a function that initialises the algorithm and then uses a private function to call the algorithm recursively to build our tree.
Note how we call _id3_recv( ) function in self.node. This function returns an instance of the class Node with information of all the nodes (decisions) in the Decision Tree.
_id3recv( ) is the trickiest function to code so let’s spend some time understanding what is does.
Let’s generate some data
I have generated a small toy dataset so we can work on a small example and actually visualise the resulting tree. This will help us understand the algorithm as well.
Let’s imagine that we love surfing waves and would like to have a model to predict if there will be good waves in our favourite spot.
Our features are:

And our target is simply a binary categorial variable that tells us if there were/will be good waves based on the information provided in our features:

Let’s generate some synthetic data using random sampling:

_id3_recv( ) function
Now that we have some data to work with it will be easier to understand _id3_recv ( ) function. Below is the code with some annotations:
The function takes two lists containing instances ids and feature ids and an instance of the class Node. Firstly, it checks if the node attribute is None and if it is, it initialises an instance of the class Node.
Then, at each split the algorithm checks if some of the following conditions are met:
- Are all the instances of the same category? If so, it assigns the category to the node value and returns the variable node. The following chunk of code does this task in the function:
- Are there more features to compute the information gain? If there are not, it assigns the most probably category to the node value (it computes the mode) and returns the variable node. Code doing this is below:
If none of the above conditions are met, it chooses the feature that maximises the information gain and stores it in the node value. In this case, the node value is the feature with which it is making the split.
The ID3 algorithm creates a branch for each value of the selected feature and finds the instances in the training set that takes that branch. Note each branch is represented with a new instance of the class node that also contains the the next node.
For the next node, it computes the information gain for the remaining features and instances and chooses the one that maximises it to make another split until all the instances have the same class or there not more features to compute, returning the most probable category in this case. All of this is done in the last part of the function:
Resulting Decision Tree
And here we have the Decision Tree Classifier that the ID3 algorithm has built using our synthetic data:

It seems that our "synthetic" beach only works well when the wind blows from the South. If there is a large swell, it is likely that there will be good waves regardless of the tide but if the swell is medium, we will usually find better surfing conditions during low tide.
I would like to give credit to the Professors of the MSc in Machine Learning, Deep Learning and AI by Big Data International Campus for providing a version of the source code, that inspired me to write this article.
References
- Text notes of MSc in Machine Learning, Deep Learning and AI by Big Data International Campus and UCAM