The world’s leading publication for data science, AI, and ML professionals.

Algorithms From Scratch: Decision Tree

Detailing and Building a Decision Tree model from Scratch

Photo by Johann Siemens on Unsplash
Photo by Johann Siemens on Unsplash

Algorithms From Scratch

Those of you familiar with my earlier writings would recall that I once wrote an overview of the Random Forest algorithm. A solid foundation on Decision trees is a prerequisite to understanding the inner workings of Random Forest; The Random forest builds multiple Decision tree’s and outputs the average over the predictions of each tree for regression problems, and in classification problems it outputs the relative majority of the predictions from each tree.

Random Forest Overview

To build on the above story, I will be focusing much more on the Decision tree learning algorithm as it is a fundamental algorithm in the Machine Learning space. Many models base their structure on the Decision tree model such as the Random Forest and Gradient Boosted tree’s. Additionally, I will be doing a Python implementation of this algorithm from scratch to further expand our intuition on what is happening within our algorithm.

Defining terms

  • Parametric Models: Used to make inferences about population parameters, however these inferences are not valid if all assumptions are not met.
  • Non-Parametric Models: Do not assume that the data or population have any characteristic structure.

Decision Tree (CART)

Popular due to its intelligibility and simplicity, the Decision tree is one of the easiest algorithms to visualize and interpret which is handy when presenting results to a non-technical audience, as is often required in industry. If we simply consider a tree in a flowchart-like state, from root to leaves where the path to a leaf from the root defines decision rules on the features, then we already have a good level of intuition required to understand Decision tree learning.

Unlike the first two algorithms we covered in the Algorithms from Scratch series (Linear Regression and Logistic Regression), the Decision tree algorithm is a non-parametric algorithm meaning that it does not make an assumption about the data or population. This does have an effect on our model since we are trading bias for variance in the model during training making the Decision tree much more prone to overfitting.


In the field of Machine Learning there are two main Decision tree models. The one we use depends on the type of target variable we are attempting to predict:

Classification Tree: A tree model employed to predict a target variable that takes a discrete value. Thereby, the leaf node represents a class and the branches represent conjunctions of the features that lead to those class labels.

Regression Tree: A tree model employed to predict a target variable that takes a continuous value. Contrary to the classification tree, in the Regression tree, each leaf node contains a continuous value (i.e. a house price); The branches represent conjunctions of the features that lead to each continuous variable.

Note: An umbrella term to refer to both procedures is Classification and Regression Tree (CART), first introduced by Breiman et al. in 1984.

In Figure 1 we can see the structure that is followed by CART algorithms. Though this structure is set for both trees, there are some subtle differences between classification and regression trees such as the output from each tree; The classification tree returns mode class of the leaf node and the regression tree returns the mean.

Figure 1: Decision tree diagram.
Figure 1: Decision tree diagram.

Another significant difference between the two algorithms is the criterion that we try to minimize when partitioning the feature space. Generally, we want to select the feature, j, and split-point, s, that best splits the feature space into 2 regions, but how this is measured in a regression tree and classification tree differs as is shown in Figure 2.

Figure 2: Impurity formulas used in sklearn. [Source: Stacey Ronaghan - The Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark]
Figure 2: Impurity formulas used in sklearn. [Source: Stacey RonaghanThe Mathematics of Decision Trees, Random Forest and Feature Importance in Scikit-learn and Spark]

Chunking the Algorithm

Note: We will be building a Decision tree classifier with gini impurity as the criterion for the split.

  1. Consider all possible splits of feature j and split point s
  2. Having found the best split, partition the data into 2 resulting regions
  3. Repeat 1 and 2 until stopping criterion is reached

The pseudocode above demonstrates a phenomena known as recursion in computer science: A method of solving a problem where the solution depends on solutions to smaller instances of the same problem (Source: Wikipedia), and Binary splitting hence in some illustrations step 1 – 2 is referred to as recursive binary splitting.

Implementation

For this implementation we will be leveraging the following frameworks:

  • NumPy (linear algebra and data manipulation)
  • Pandas (data manipulation)
  • Sci-kit Learn (machine learning)
  • Graphviz (graph visualization software)

View the full code here…

kurtispykes/ml-from-scratch

import numpy as np 
import pandas as pd 
import graphviz
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split

The dataset we will use is the iris dataset from Scikit learn – See Documentation

# loading the data set
dataset = load_iris(as_frame=True)
df= pd.DataFrame(data= dataset.data)

# adding the target and target names to dataframe
target_zip= dict(zip(set(dataset.target), dataset.target_names))
df["target"] = dataset.target
df["target_names"] = df["target"].map(target_zip)

print(df.shape)
df.head()
(150, 6)
Figure 3: Iris Dataset (first 5 rows)
Figure 3: Iris Dataset (first 5 rows)
# Seperating to X and Y 
X = df.iloc[:, :4]
y = df.iloc[:, -1]

# splitting training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, shuffle=True, random_state=24)
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
DecisionTreeClassifier()
dot_data = export_graphviz(dt, out_file=None, 
                           feature_names=X.columns,  
                           class_names=dataset.target_names,  
                           filled=True, rounded=True,  
                           special_characters=True)  
graph = graphviz.Source(dot_data)  
graph
Figure 4: Trained Decision tree visualization— Output of Code cell above.
Figure 4: Trained Decision tree visualization— Output of Code cell above.
sklearn_y_preds = dt.predict(X_test)
print(f"Sklearn Implementation:nACCURACY: {accuracy_score(y_test, sklearn_y_preds)}")
Sklearn Implementation:
ACCURACY: 0.9473684210526315

My Implementation

We are going to need to split our data into true and false index in response to the decision rule at a specific branch. If the condition of the decision rule is met, we say that branch is true (which we will denote as the left) and false (denoted as right).

Partitioning the Data

def partition(data, column, value): 
    """
    Partition the data into left (indicating True) and right (indicating false).

    Inputs
    data: The data to partition 

    Outputs
    left: index of values that meet condition
    right: index of values that fail to meet the condition
    """
    left = data[data[column] <= value].index
    right = data[data[column] > value].index

    return left, right

To check if our function works correctly, we will perform a split on all of the data and pass it the best column and value manually to see if our data is separated with respect to the graph above.

# performing a split on the root node
left_idx, right_idx = partition(X_train, "petal length (cm)", 2.45)

print("[petal length (cm) <= 2.45]")

# print results --> left_idx = 38 setosa | right index = 42 versicolor, 32 virginica 
print(f"left_idx: {dict(zip(np.unique(y_train.loc[left_idx], return_counts=True)[0], np.unique(y_train.loc[left_idx], return_counts=True)[1]))}n
right_idx: {dict(zip(np.unique(y_train.loc[right_idx], return_counts=True)[0], np.unique(y_train.loc[right_idx], return_counts=True)[1]))}")
[petal length (cm) <= 2.45]
left_idx: {'setosa': 38}
right_idx: {'versicolor': 42, 'virginica': 32}

Perfect!

Next, we need a criterion that we want to minimize. In this implementation we will be minimizing the gini impurity using a function called gini_impurity. Without getting to technical gini impurity simply measures how mixed our data is at a node; To help grasp this concept think of gold. When gold is impure it refers to the mixture of other substances within it, however when it is pure we can say there are 0 impurities (this isn’t exactly true as refined gold up to 99.99% pure so technically there are still some impurities).

The ideal is to nodes that are pure meaning that the target labels are separated into separate nodes. – To go into the technical details of gini impurity see here

Gini Impurity

def gini_impurity(label, label_idx):
    """
    A measure of how often a randomly chosen element from the set would
    be incorrectly labelled if it was randomly labelled according to the 
    distribution of labels in the subset (Soure: Wikipedia)

    Inputs
    label: The class label available at current node

    Outputs
    impurity: The gini impurity of the node 
    """
    # the unique labels and counts in the data
    unique_label, unique_label_count = np.unique(label.loc[label_idx], return_counts=True)

    impurity = 1.0
    for i in range(len(unique_label)):
        p_i = unique_label_count[i] / sum(unique_label_count)
        impurity -= p_i ** 2 
    return impurity

If you scroll back up to Figure 4, you will see that the impurity at the root node is 0.663. Therefore, to determine whether our gini_impurity function is working correctly, we should see this number on output

# Gini impurity of the first node
impurity = gini_impurity(y_train, y_train.index)
impurity`
0.6626275510204082

Great!

To split at a feature (and value) we need to a way of quantifying what would result in the best outcome if we were to split at that point. Information gain is a useful way to quantify what feature and feature value to split on at each node. For each node of the tree, the information value "represents the expected amount of information that would be needed to specify whether a new instance should be classified yes or no, given that the example reached that node". (Source: Wikipedia)

Information Gain

def information_gain(label, left_idx, right_idx, impurity): 
    """
    For each node of the tree, the information gain "represents the
    expected amount of information that would be needed to specify whether
    a new instance should be classified yes or no, given that the example
    reached that node. (Source: Wikipedia)

    Inputs
    left: The values that met the conditions of the current node
    right: The values that failed to meet the conditions of the current noode
    gini_impurity: the uncertainty at the current node

    Outputs
    info_gain: The information gain at the node
    """

    p = float(len(left_idx)) / (len(left_idx) + len(right_idx))
    info_gain = impurity - p * gini_impurity(label, left_idx) - (1 - p) * gini_impurity(label, right_idx)
    return info_gain

The best first split is the one that provides the most information gain. This process is repeated for each impure node until the tree is complete. (Source: Wikipedia)

Based on the above statement, we can now see why petal length (cm) with the value 2.45 was selected as the first split.

# testing info gain of the first split at root node
info_gain = information_gain(y_train, left_idx, right_idx, impurity)
info_gain
0.33830322669608387
# testing a random feature and value to see the information gain
left_idx, right_idx = partition(X_train, "petal width (cm)", 1.65)

impurity = gini_impurity(y_train, y_train.index)

info_gain = information_gain(y_train, left_idx, right_idx, impurity)
info_gain
0.25446843371494937

The above helper functions are now going to be brought into play. We had to manually select the feature and value before right? The next function will now automatically search the feature space and find the feature and feature value the best splits the data.

Finding the Best Split

def find_best_split(df, label, idx):
    """
    Splits the data on the best column and value 

    Input
    df: the training data
    label: the target label 
    idx: the index of the data

    Output: 
    best_gain: the max information gain
    best_col: the column that produced best information gain
    best_val: the value of the column that produced best information gain

    """

    best_gain = 0 
    best_col = None
    best_value = None

    df = df.loc[idx] # converting training data to pandas dataframe
    label_idx = label.loc[idx].index # getting the index of the labels

    impurity = gini_impurity(label, label_idx) # determining the impurity at the current node

    # go through the columns and store the unique values in each column (no point testing on the same value twice)
    for col in df.columns: 
        unique_values = set(df[col])
        # loop thorugh each value and partition the data into true (left_index) and false (right_index)
        for value in unique_values: 

            left_idx, right_idx = partition(df, col, value)
            # ignore if the index is empty (meaning there was no features that met the decision rule)
            if len(left_idx) == 0 or len(right_idx) == 0: 
                continue 
            # determine the info gain at the node
            info_gain = information_gain(label, left_idx, right_idx, impurity)
            # if the info gain is higher then our current best gain then that becomes the best gain
            if info_gain > best_gain:
                best_gain, best_col, best_value = info_gain, col, value

    return best_gain, best_col, best_value
find_best_split(X_train, y_train, y_train.index)
(0.33830322669608387, 'petal length (cm)', 1.9)

Great, we have all the components we need for our algorithm to work. However, the above functions on only perform one split on our training data (the stump/root).

Additional Helper Functions

# helper function to count values
def count(label, idx):
    """
    Function that counts the unique values

    Input
    label: target labels
    idx: index of rows 

    Output
    dict_label_count: Dictionary of label and counts

    """
    unique_label, unique_label_counts = np.unique(label.loc[idx], return_counts=True)
    dict_label_count = dict(zip(unique_label, unique_label_counts))
    return dict_label_count
# check counts at first node to check it aligns with sci-kit learn
count(y_train, y_train.index)
{'setosa': 38, 'versicolor': 42, 'virginica': 32}

Here are some classes that we will use to store specific data from our Decision tree and print our tree.

# https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb
class Leaf:
    """
    A Leaf node classifies data.

    This holds a dictionary of class (e.g., "Apple") -> number of times
    it appears in the rows from the training data that reach this leaf.
    """

    def __init__(self, label, idx):
        self.predictions = count(label, idx)
# https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb
class Decision_Node:
    """
    A Decision Node asks a question.

    This holds a reference to the question, and to the two child nodes.
    """

    def __init__(self,
                 column,
                 value,
                 true_branch,
                 false_branch):
        self.column = column
        self.value = value
        self.true_branch = true_branch
        self.false_branch = false_branch
# https://github.com/random-forests/tutorials/blob/master/decision_tree.ipynb
def print_tree(node, spacing=""):
    """
    World's most elegant tree printing function.

    Input
    node: the tree node
    spacing: used to space creating tree like structure
    """

    # Base case: we've reached a leaf
    if isinstance(node, Leaf):
        print (spacing + "Predict", node.predictions)
        return

    # Print the col and value at this node
    print(spacing + f"[{node.column} <= {node.value}]")

    # Call this function recursively on the true branch
    print (spacing + '--> True:')
    print_tree(node.true_branch, spacing + "  ")

    # Call this function recursively on the false branch
    print (spacing + '--> False:')
    print_tree(node.false_branch, spacing + "  ")

Recursive Binary Splitting

For the algorithm to work we will need the splits to happen recursively until we meet a stopping criterion – in this case it’s until each leaf node is pure.

def build_tree(df, label, idx): 
    """
    Recursively Builds the tree until is leaf is pure. 

    Input 
    df: the training data
    label: the target labels
    idx: the indexes 

    Output
    best_col: the best column 
    best_value: the value of the column that minimizes impurity 
    true_branch: the true branch 
    false_branch: the false branch 
    """
    best_gain, best_col, best_value = find_best_split(df, label, idx)

    if best_gain == 0: 
        return Leaf(label, label.loc[idx].index)

    left_idx, right_idx = partition(df.loc[idx], best_col, best_value)

    true_branch = build_tree(df, label, left_idx)

    false_branch = build_tree(df, label, right_idx)

    return Decision_Node(best_col, best_value, true_branch, false_branch)
my_tree = build_tree(X_train, y_train, X_train.index)
print_tree(my_tree)
[petal length (cm) <= 1.9]
--> True:
  Predict {'setosa': 38}
--> False:
  [petal width (cm) <= 1.6]
  --> True:
    [petal length (cm) <= 4.9]
    --> True:
      Predict {'versicolor': 40}
    --> False:
      [sepal length (cm) <= 6.0]
      --> True:
        [sepal width (cm) <= 2.2]
        --> True:
          Predict {'virginica': 1}
        --> False:
          Predict {'versicolor': 1}
      --> False:
        Predict {'virginica': 2}
  --> False:
    [petal length (cm) <= 4.8]
    --> True:
      [sepal width (cm) <= 3.0]
      --> True:
        Predict {'virginica': 3}
      --> False:
        Predict {'versicolor': 1}
    --> False:
      Predict {'virginica': 26}

Super! Now you’ve seen how to implement a Decision tree from scratch and we have trained it on our training data. It does not stop there though, the purpose of building the algorithm in the first place is to automate the classification of new observations. The next section is dedicated to inference…

Inference

def predict(test_data, tree):

    """
    Classify unseen examples

    Inputs
    test_data: Unseen observation
    tree: tree that has been trained on training data

    Output
    The prediction of the observation.
    """

    # Check if we are at a leaf node
    if isinstance(tree, Leaf): 
        return max(tree.predictions)

    # the current feature_name and value 
    feature_name, feature_value = tree.column, tree.value

    # pass the observation through the nodes recursively
    if test_data[feature_name] <= feature_value: 
        return predict(test_data, tree.true_branch)

    else: 
        return predict(test_data, tree.false_branch)

To check if our function is operating correctly I will use one example observation.

# taking one instance to test function 
example, example_target = X_test.iloc[6], y_test.iloc[6]
example, example_target
(sepal length (cm)       5.3
 sepal width (cm)        3.7
 petal length (cm)       1.5
 petal width (cm)        0.2
 predictions          setosa
 Name: 48, dtype: object,
 'setosa')

When we apply our predict function to the example we should hope that the observation traverses the tree accordingly and outputs setosa, lets check…

# if working correctly should output setosa
predict(example, my_tree)
'setosa'

Supreme!! However, that is just to one example. If we want to apply this function to every observation in our test set we can use df.apply – see Documentation

# create a new col of predictions
X_test["predictions"] = X_test.apply(predict, axis=1, args=(my_tree,))

Okay, here comes the moment of truth. We need to check if our algorithm returns the same predictions as the scikit learn model as a way of checking if we have implemented our algorithm correctly. We do this by simply doing sklearn_y_preds == X_true["predictions"] which returns a boolean array for each observation – in our case they are all true.

print(f"Sklearn Implementation:nACCURACY: {accuracy_score(y_test, sklearn_y_preds)}nn
My Implementation:nACCURACY: {accuracy_score(y_test, X_test['predictions'])}")
Sklearn Implementation:
ACCURACY: 0.9736842105263158

My Implementation:
ACCURACY: 0.9736842105263158

Pros

  • Simple and easy to interpret
  • Able to handle both numerical and categorical data
  • Requires little data preparation
  • Performs well with large datasets
  • Built in feature selection

Cons

  • Instability of trees (Changing something in the data can change everything)
  • Lack of smoothness (Particular issue for regression problems)
  • Prone to overfitting

Wrap Up

Establishing a good foundation of Decision trees will go a long way in understanding many other important Machine Learning algorithms. It is a very powerful algorithm that is often used as an ensemble model to win various Data Science competitions. Although its very easy to conceptualize, the decision tree is quite difficult to construct from scratch, hence why I’d always advocate for using established Machine Learning frameworks whenever possible.

Thank you for reading to the end of the article! If you’d like to keep in contact with me, I am most accessible on LinkedIn.

Kurtis Pykes – AI Writer – Towards Data Science | LinkedIn


Related Articles