source: my sketch book

Decision Tree - Data Scientist’s magic bullet for Hamletian Dilemma

Prasad Patil
Towards Data Science
6 min readJun 2, 2018

--

Decision trees belongs to the family of supervised machine learning algorithms and are considered to be a panacea of all data science problems. Data scientist often caught making witty remarks such as, “whenever problem statement puts you in a hamletian dilemma and you can’t think of any algorithm (irrespective of situation), use decision tree!”.

More often than not ,be it in industry or be it in kaggle competitions, it is seen that decision tree or at least the algorithms that are evolved from it (Bagging,Boosting ensemble), are practiced religiously.

Decision tree is a versatile machine learning method capable of performing both regression and classification tasks. Almost all real world problems are non-linear in nature and decision trees helps you get rid of non-linearity in the data. This algorithm is very intuitive,simple to understand and can be visually interpretable — something every business would want in first place.

What else one can desire from a model? Simply Magical ✨

Decision trees are drawn upside down, meaning root is at the top and the leaves are at the bottom.Primarily decision tree believes in divide & Conquer rule.

Basic Terminology

Let’s look at the basic terminology used with Decision trees:

  1. Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  4. Leaf/ Terminal Node: Nodes which do not split further are called Leaf or Terminal nodes.
  5. Pruning: When we remove sub-nodes of a decision node, this process is called pruning.
  6. Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
  7. Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.

Intuition

There are two type of Decision trees :

A.Classification Decision Trees & B.Regression Decision Trees

  • Classification trees helps you classify your data and hence they work on categorical data for example loan status (approved/not approved),spam/not spam etc.
  • Regression trees are designed to help you predict outcome which can be a real number for example income of a person, sale price of a house etc.
source: algobeans.com — Classification Tree illustration

Suppose we have two features X and Y and in right panel you can observe there are several data points scattered.Green and grey leafs are two classes in dependent variable. So what basically decision tree does is,it cuts entire data set into slices in several iterations. As shown, there is a split 1 at X = 0.5 , split 2 at Y =0.5 and split 3 at X = 0.25.

Splits are orchestrated to maximize the number of certain category in each of these splits, meaning decision trees tries to accomplish homogeneous distribution at each node.From right panel , you can notice separation of green leaf category and grey leaf category forms homogeneous structure in the end in every compartment.

Math behind the algorithm

There are various methods adopted by decision tree for splitting nodes. Most commonly followed are - Gini Index,Entropy,Chi-square etc.

I. Gini Index

  • According to it if we select two items from a population at random then they must be of same class and probability is 1 if population is pure.
  • It is a measure of impurity. Therefore,lower the value of Gini index, higher is the homogeneity.
  • Mathematically represented as

Gini Index = 1-[(P)²+(1-P)²]

where P is proportion of positive samples in that node.

  • Gini Index of ‘0’ indicates that node is pure.Thus it implies no further splitting is needed.
  • Steps involved -

Calculate Gini for sub-nodes, using formula.

Calculate Gini for split using weighted Gini score of each node of that split

II. Chi Square

  • ​It helps to find out the statistical significance between the differences among sub-nodes and parent node.
  • We measure it by sum of squares of standardized differences between observed and expected frequencies of target variable.
  • Mathematically represented as

Chi-square = ((Actual — Expected)² / Expected)¹/2

  • It is a measure of purity. Therefore,higher the value of Chi-square, higher is the statistical significance of differences between sub-node and Parent node.
  • Steps involved -

Calculate Chi-square for individual node by calculating the deviation for Success and Failure both

Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split

III. Entropy

  • It is a measure of the randomness in the information being processed.
  • Higher the entropy, the harder it is to draw any conclusions from that information.
  • Mathematically represented as

Entropy = -p*log(p) - q*log(q)

  • Here p and q is probability of success and failure respectively in that node.
  • Log is taken at base “2”.
  • Steps involved -

Calculate entropy of parent node

Calculate entropy of each individual node of split and calculate weighted average of all sub-nodes available in split.

IV.Reduction in Variance

  • All the methods mentioned above are pertaining to classification decision trees. In case of regression decision tree where target variable is continuous, reduction in variance method is followed.
  • It uses the standard formula of variance to choose the best split. The split with lower variance is selected as the criteria to split the population.
  • Mathematically represented as
Where, X -> actual values, X −> mean and N -> number of observations
  • Steps involved -

Calculate variance for each node.

Calculate variance for each split as weighted average of each node variance.

Lets take a discrete example and grip firmly on the topic

Visualization

As claimed earlier, decision trees are interpretable and can be visualized. Graphviz library come to rescue data scientists while working on python. HERE is the guide to download Graphviz .

I have attempted IRIS data set for visualizing decision tree and diagram displayed below obtained as a result.This is achieved using a code where you don’t even have to download Graphviz on your machine.

Find Hands-On-Code for visualizing decision tree HERE

What’s the catch?

Well, our magic bullet certainly sounds life savior but it is too good to be true,isn’t it?! The biggest peril while modeling decision trees is that their propensity to overfitting. If there is no limit set on a decision tree, it will give you 100% accuracy on training set because in the worse case it will end up making 1 terminal node for each observation.

Thus, preventing overfitting is pivotal while modeling a decision tree and it can be done in 2 ways:

A. Setting constraints on tree size(Hyper-parameter Tuning) & B. Tree pruning

Illustration of different Hyper-parameters used in Decision Tree

Further Reading

Do refer below links for deeper understanding of the concept

  1. Analytics vidhya
  2. Scikit Documentation
  3. Stack-overflow

Congratulations!! on mastering one of the age old algorithms in machine learning :)

Other Publushed Articles:

If you liked my post then spare few minutes for my other blog posts too -

  1. What is Exploratory Data Analysis?
  2. Last Minute Revision: Part I — Machine Learning & Statistics
  3. How to create a PDF report from Excel using Python

--

--