
Decision trees are one of the most fundamental Machine Learning tools which are used for both Classification and regression tasks. In this post, I will cover:
- Decision tree algorithm with Gini Impurity as a criterion to measure the split.
- Application of decision tree on classifying real-life data.
- Create a pipeline and use GridSearchCV to select the best parameters for the classification task. GitHub links for all the codes and plots will be given at the end of the post.
Let’s begin…
Decision Tree:

Idea of Decision Tree (hereafter DT) algorithm is to learn a set of if/else questions that can lead to decision. Here is a very naive example of classifying a person. If he/she claps a blog post after reading, then the person is considered as awesome, else, little less awesome. This tree is based on yes/no question. But the idea would remain same even for the numeric data. A decision tree can combine both numeric and categorical data. Some terminologies used in decision tree are shown in the picture below –

Here we see the how the nodes are divided depending on the position of them in the DT. First we need to learn how to choose the root node and here we need to learn one of the criteria to decide the nodes, Gini Impurity.
Gini Impurity:

Gini Impurity is named after the Italian statistician Corrado Gini. Gini impurity can be understood as a criterion to minimize the probability of misclassification. To understand the definition (as shown in the figure) and exactly how we can build up a decision tree, let’s get started with a very simple data-set, where depending on various weather conditions, we decide whether to play an outdoor game or not. From the definition, a data-set containing only one class will have 0 Gini Impurity. In building up the decision tree our idea is to choose the feature with least Gini Impurity as root node and so on… Let’s get started with the simple data-set –

Here we see that depending on 4 features (Outlook, Temperature, Humidity, Wind), decision is made on whether to play tennis or not. So what feature will be on the root node? To decide we will make use of Gini Impurity. Let’s start with the feature ‘Outlook’. Important to note that when ‘Outlook’ is overcast, we always go out to play tennis. That node has only one class of samples (as shown in figure below).

Since these are categorical variables if we want to apply decision tree classifier and fit to data, first we need to create dummy variables.

Here we can be sure of one thing, that once we create a decision tree the root node will definitely be the feature ‘Outlook_Overcast’. Let’s see the decision tree (shown in figure below). When ‘Outlook_Overcast’ ≤0.5 is False, i.e. when ‘Outlook Overcast’ is 1, we have a leaf node of pure samples with Gini impurity of 0.


For the root node let’s calculate the Gini Impurity. Since we have 9 ones (‘yes) and 5 zeroes (‘no’), so Gini Impurity is ~ 0.459. Next node is ‘Humidity_High’ as that feature will give us the least Gini Impurity. For a small data-set like this one, we can always use Pandas data-frame features to check this and calculate Gini Impurity for each features. Once we have ‘Outlook_Overcast’ as root node we get 4 samples (‘yes’) in a leaf nodes. And from the remaining 10 samples, we have 5 of each ‘yes’ and ‘no’. Then ‘Humidity_High’ is selected as the feature and the Gini Impurity of the node is 0.5 and so on…

Gini Impurity calculation could have a little advantage over the Entropy in a sense that it may take little less time to build a decision tree for a large data-set as, Entropy needs computation of log [1] [2]. Sebastian Raschka, author of the book ‘Python Machine Learning’ has a fantastic blog on why we use Entropy to build the decision tree instead of classification error [3]. Check that out! Let’s move to the next section to implement Decision Tree algorithm for a realistic data-set.
Bank Term Deposit Data-Set:
Here I’ll be using the Bank Marketing Data-Set, available in the UC Irvine Machine Learning Repository. Abstract of the data-set as stated in the website is
Abstract: The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Let’s load the data-set using Pandas

So there are 16 features including categorical and numerical variables and total number of samples are 11162. First we check how the labels (‘yes’, ‘no’) are distributed. We can use Seaborn countplot as below –

The data-set is slightly biased towards more number of rejections (‘no’). So, later on when the data-set will be split into train and test set, we will use stratify. We can also check distribution of some numeric variables using Matplotlib Hist as shown below –


Correlation plot (plotted using Seaborn Heatmap)of numerical variables show very little correlation between the features. Once we have played around enough with the data-set, let’s prepare our data-set for applying DT algorithm. Since there are several categorical variables, we need to convert them to dummy variables. I dropped the feature ‘duration’ because as mentioned in the description of the data-set this feature highly affects the target variable (when duration=0, y= ‘no’).

Next step is to select features and labels –

Next step is to split the data-set in train and test set –

Applying Decision Tree Classifier:
Next, I created a pipeline of StandardScaler (standardize the features) and DT Classifier (see a note below regarding Standardization of features). We can import DT classifier as from sklearn.tree import DecisionTreeClassifier
from Scikit-Learn. To determine the best parameters (criterion of split and maximum tree depth) for DT classifier, I also used Grid Search Cross Validation. The code snippet below is self-explanatory.

Next, I’ve applied 3, 4, 5 fold cross-validation to determine the best parameters –

Here we have seen, how to successfully apply decision tree classifier within grid search cross validation, to determine and optimize the best fit parameters. Since this particular example has 46 features, it is very difficult to visualize the tree here in a Medium page. So, I made the data-frame simpler by dropping the ‘month’ feature (since it creates maximum number of dummy variables, 12) and did the fitting procedure one more time with number of features now is 35.

Let’s plot the decision tree with maximum depth 6 and ‘Gini’ as criterion. Visualizing the tree using Scikit Learn needs some coding –

Let’s see the root and first few nodes of the tree in more detail –

Here we see ‘contanct_unknown’ is selected as the feature for root node. Total number of training samples were 8929, and the Gini Impurity is ~ 0.5. Next depth we see one numerical variable ‘pdays’ is selected as attribute to split the samples and so on… With so many features and specially widely distributed numerical features it is extremely difficult to build a tree manually and this scenario can be compared with the previously used much simpler data-set of playing tennis. We can also plot the which features are important in building up the tree using feature_importance_
attribute of the DecisionTreeClassifier class. The figure is shown below –

As expected from the tree ‘contanct_unknown’ which is the root node of the tree has the highest importance and many features have almost zero or negligible role to play. A feature with low or zero feature importance could imply that there is (are) another feature(s) which encode(s) the same information.
Note: Another relevant info regarding this post is that even though I have used Standardization, DT algorithm is completely invariant to the scaling of the data [4]. As every feature is processed separately, the scaling of the features don’t have any effect on the splitting.
So to conclude, we have learned the basics of building a DT using Gini Impurity as a criterion to split. We also implemented a Grid Search Cross Validation to select the best parameters for our model to classify a realistic data-set. Hopefully, you have found all these little helpful!
Stay Strong and Cheers!
References:
[1] Binary Classification with Decision Trees; Sebastian Raschka
[2] Decision Tree: Carlos Guestrin, University of Washington, Lecture Notes.
[3] Why Use Entropy to Grow Decision Tree?; Sebastian Raschka
[4] Introduction to Machine Learning with Python; A. Muller, S.Guido.