Demystifying the Random Forest

Deconstructing and Understanding this Beautiful Algorithm

Siddarth Ramesh
Towards Data Science

--

Photo by Inggrid Koe on Unsplash

In classical Machine Learning, Random Forests have been a silver bullet type of model.

The model is great for a few reasons:

  • Requires less preprocessing of data compared to many other algorithms, which makes it easy to set up
  • Acts as either a classification or regression model
  • Less prone to overfitting
  • Easily can compute feature importance

In this post, I want to better understand the components that make up a Random Forest. To accomplish this, I am going to deconstruct the Random Forest into its most basic components and explain what is going on in each level of computation. By the end, we will have attained a much deeper understanding of how Random Forests work and how to work with them with more intuition. The examples we will use will be focused on classification, but many of the principles apply to the regression scenarios as well.

Running a Random Forest

Let’s start by invoking a classic Random Forest pattern. This is the highest level, and what many people do when training Random Forests in python.

Simulated data. Image by Author

If I wanted to run a Random Forest to predict my target column, I would just do the following

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=0)

# Train and score Random Forest
simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
simple_rf_model.fit(X_train, y_train)
print(f"accuracy: {simple_rf_model.score(X_test, y_test)}")

# accuracy: 0.93

Running the Random Forest classifier was super simple. I just defined the n_estimators parameter and set the random_state to 0. I can tell you from personal experience that a lot of people would just look at that .93 , be happy, and deploy that thing in the wild. We won’t be doing that today.

Let’s re-examine this innocuous line

simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)

A random state is a feature of most Data Science models that ensures that someone else can reproduce your work. We won’t worry too much about that parameter.

Let’s look deeply at n_estimators. If we look at the scikit-learn docs, the definition states:

The number of trees in the forest.

Investigating the Number of Trees

At this point, let’s define a Random Forest a little more specifically. A Random Forest is an ensemble model that is a consensus of many Decision Trees. The definition is probably incomplete, but we will come back to it.

Many trees talk to each other and arrive at a consensus. Image by Author

That might make you think that if you broke it down into something like the following, you might arrive at a Random Forest:

# Create decision trees
tree1 = DecisionTreeClassifier().fit(X_train, y_train)
tree2 = DecisionTreeClassifier().fit(X_train, y_train)
tree3 = DecisionTreeClassifier().fit(X_train, y_train)

# predict each decision tree on X_test
predictions_1 = tree1.predict(X_test)
predictions_2 = tree2.predict(X_test)
predictions_3 = tree3.predict(X_test)
print(predictions_1, predictions_2, predictions_3)

# take the majority rules
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
print(final_prediction)

In that above example, we trained 3 decision trees on X_train, which means that n_estimators = 3. After training the 3 trees, we predicted each tree on the same test set and then finally take the prediction that 2 out of 3 trees pick.

That sort of makes sense, but this doesn’t look totally right. If all the decision trees were trained on the same data, wouldn’t they mostly reach the same conclusion, thereby negating the advantage of an ensemble?

Demystifying Sampling With Replacement

Let’s add a word to our definition from earlier: A Random Forest is an ensemble model that is a consensus of many uncorrelated Decision Trees.

The Decision Trees can become uncorrelated in 2 ways:

  1. You have a large enough dataset size where you can sample unique parts of your data to each decision tree. This is not as popular and often requires a huge amount of data.
  2. You can leverage a technique called sampling with replacement. Sampling with replacement means that a sample that is drawn from a population is returned to the population before the next sample is drawn.

To explain sampling with replacement, let’s say I had 5 marbles with 3 colors, so my population looks like this:

blue, blue, red, green, red

If I wanted to sample some marbles, I’d normally pluck out a couple and perhaps end up with:

blue, red

This is because once I picked up red, I didn’t return it to my original stack of marbles.

However, if I was doing sampling with replacement, I can actually pick up any of my marbles twice. Because red was returned to my stack, I had a chance of picking it up again.

red, red

In Random Forests, the default is to build samples that are about 2/3 of the original population size. If my original train data was 1000 rows, then the train data samples I feed to my trees might be around 670 rows. That being said, trying different sampling ratios would be a great parameter to tune as you construct your Random Forest.

The below code, unlike the last snippet, is much closer to that of a Random Forest where n_estimators = 3.

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Take 3 samples with replacement from X_train for each tree
df_sample1 = df.sample(frac=.67, replace=True)
df_sample2 = df.sample(frac=.67, replace=True)
df_sample3 = df.sample(frac=.67, replace=True)

X_train_sample1, X_test_sample1, y_train_sample1, y_test_sample1 = train_test_split(df_sample1.drop('target', axis=1), df_sample1['target'], test_size=0.2)
X_train_sample2, X_test_sample2, y_train_sample2, y_test_sample2 = train_test_split(df_sample2.drop('target', axis=1), df_sample2['target'], test_size=0.2)
X_train_sample3, X_test_sample3, y_train_sample3, y_test_sample3 = train_test_split(df_sample3.drop('target', axis=1), df_sample3['target'], test_size=0.2)

# Create the decision trees
tree1 = DecisionTreeClassifier().fit(X_train_sample1, y_train_sample1)
tree2 = DecisionTreeClassifier().fit(X_train_sample2, y_train_sample2)
tree3 = DecisionTreeClassifier().fit(X_train_sample3, y_train_sample3)

# predict each decision tree on X_test
predictions_1 = tree1.predict(X_test)
predictions_2 = tree2.predict(X_test)
predictions_3 = tree3.predict(X_test)
df = pd.DataFrame([predictions_1, predictions_2, predictions_3]).T
df.columns = ["tree1", "tree2", "tree3"]

# take the majority rules
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3, final_prediction, y_test]).T.head(20)
preds.columns = ["tree1", "tree2", "tree3", "final", "label"]
preds
We sample with replacement, feed those samples to trees, produce results, and achieve consensus. Image by Author.

Bagging Classifiers

The earlier architecture was actually a Bagging Classifier. Image by Author.

We’re going to introduce a new algorithm at this point called Bootstrap Aggregation, also known as Bagging, but rest assured that this will tie back to Random Forests. The reason we are introducing this new concept is because as we can see in the below figure, everything we just did up to now is actually what a BaggingClassifier does!

In the code below, the BaggingClassifier has a parameter called bootstrap which actually executes the sampling-with-replacement step we just manually did. This same parameter exists for the sklearn Random Forest implementation. If bootstrapping was false, we would use the entire population for each classifier.

import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

# Number of trees to use in the ensemble
n_estimators = 3

# Initialize the bagging classifier
bag_clf = BaggingClassifier(
DecisionTreeClassifier(), n_estimators=n_estimators, bootstrap=True)

# Fit the bagging classifier on the training data
bag_clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = bag_clf.predict(X_test)
pd.DataFrame([y_pred, y_test]).T

BaggingClassifiers are awesome because you can use them with estimators not named Decision Trees! You can plug in many algorithms, and Bagging turns it into an ensemble solution. A Random Forest Algorithm actually extends the Bagging Algorithm (if bootstrapping = true) because it partially leverages the bagging to form uncorrelated decision trees.

However even if bootstrapping = false, Random Forests go one step extra to really make sure the trees are not correlated — feature sampling.

Demystifying Feature Sampling

Feature sampling means that not only are the rows sampled, but the columns too. Unlike the rows, the columns of the Random Forest are sampled without replacement, which means we won’t have duplicate columns training 1 tree.

There are many ways to sample the features. You can specify a fixed max number of features to sample, take the square-root of the total number of features, or try using logs. Each of these approaches has tradeoffs and will depend on your data and use-case.

Bagging gets extended with Feature Sampling. Image by Author.

The below snippet samples the columns using the sqrt technique, samples the rows, trains 3 Decision trees, and uses majority-rules to make the prediction. We first perform sampling-with-replacement, them sample the columns, train our individual trees, let our trees predict on our test data, then take the majority rules consensus.

import numpy as np
import pandas as pd
import math
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# take 3 samples from X_train for each tree
df_sample1 = df.sample(frac=.67, replace=True)
df_sample2 = df.sample(frac=.67, replace=True)
df_sample3 = df.sample(frac=.67, replace=True)

# split off train set
X_train_sample1, y_train_sample1 = df_sample1.drop('target', axis=1), df_sample1['target']
X_train_sample2, y_train_sample2 = df_sample2.drop('target', axis=1), df_sample2['target']
X_train_sample3, y_train_sample3 = df_sample3.drop('target', axis=1), df_sample3['target']

# get sampled features for train and test using sqrt, note how replace=False now
num_features = len(X_train.columns)
X_train_sample1 = X_train_sample1.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)
X_train_sample2 = X_train_sample2.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)
X_train_sample3 = X_train_sample3.sample(n=int(math.sqrt(num_features)), replace=False, axis = 1)

# create the decision trees, this time we are sampling columns
tree1 = DecisionTreeClassifier().fit(X_train_sample1, y_train_sample1)
tree2 = DecisionTreeClassifier().fit(X_train_sample2, y_train_sample2)
tree3 = DecisionTreeClassifier().fit(X_train_sample3, y_train_sample3)

# predict each decision tree on X_test
predictions_1 = tree1.predict(X_test[X_train_sample1.columns])
predictions_2 = tree2.predict(X_test[X_train_sample2.columns])
predictions_3 = tree3.predict(X_test[X_train_sample3.columns])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3]).T
preds.columns = ["tree1", "tree2", "tree3"]

# take the majority rules
final_prediction = np.array([np.round((predictions_1[i] + predictions_2[i] + predictions_3[i])/3) for i in range(len(predictions_1))])
preds = pd.DataFrame([predictions_1, predictions_2, predictions_3, final_prediction, y_test]).T.head(20)
preds.columns = ["tree1", "tree2", "tree3", "final", "label"]

When I ran this code, I found that my decision trees started predicting different things, which indicated that we had removed a lot of the correlations between the trees.

My trees no longer are agreeing with each other all the time! Image by Author.

Decision Tree Basics

Up to this point, we’ve deconstructed how data is fed into a multitude of Decision Trees. In the previous code examples, we used the DecisionTreeClassifier() to train Decision Trees, but we will need to deconstruct Decision Trees in order to completely understand Random Forests.

A Decision Tree, true to its name, looks like an upside-down tree. At a high level, the algorithm tries to ask questions to split data into different nodes. The below image shows an example of what a decision tree looks like.

Example Tree. Image by author

A decision tree asks a series of questions based on the answer of the previous question. For every question it asks, there might be multiple answers, and we visualize that as splitting the node. The answer of the previous question will determine the next question the tree will ask. At some point after asking a series of questions, you arrive at the answer.

But how do you know your answer is accurate, or that you have asked the right questions? You can actually evaluate your decision trees in a couple different ways, and we will of course break down those approaches as well.

Entropy and Information Gain

At this point, we need to discuss a new term called entropy. At a high level, entropy is one way to measure the level of impurity or randomness in a node. As an aside, there is another popular to way measure the impurity of a node called Gini impurity, but we won’t deconstruct that method in this post since it overlaps with a lot of the concepts around entropy albeit with a slightly different computation. The general idea is that higher the entropy or Gini impurity, the more variance there is in the node, and our goal is to reduce that uncertainty.

Decision trees try to minimize entropy by splitting the node they ask about into smaller more homogenous nodes. The actual formula for entropy is

To break down entropy, let’s go back to that marble example:

Let’s say I have 10 marbles. 5 of them are blue and 5 of them are green. The entropy of my collective dataset would be 1.0, and the code to calculate entropy is below:

from collections import Counter
from math import log2

# my predictor classes are 0 or 1. 0 is a blue marble, 1 is a green marble.
data = [0, 0, 0, 1, 1, 1, 1, 0, 1, 0]
# get length of labels
len_labels = len(data)
def calculate_entropy(data, len_labels):
# we do a count of each class
counts = Counter(labels)
# we calculate the fractions, the output should be [.5, .5] for this example
probs = [count / num_labels for count in counts.values()]
# the actual entropy calculation
return - sum(p * log2(p) for p in probs)

calculate_entropy(labels, num_labels)

If the data was completely filled with green marbles, the entropy would have been 0, and the entropy would increase the closer we got to that 50% split.

Every time we reduce entropy, we gain some information about the dataset since we have reduced the randomness. Information gain tells us which feature relatively allowed us best to minimize our entropy. The way you calculate information gain is:

entropy(parent) — [weighted_average_of_entropy(children)]

In this case, the parent is the original node and the children are the results of splitting the node.

Splitting a node. Image by Author

To calculate the information gain, we perform the following operations:

  • Calculate the entropy of the parent node
  • Split the parent node into child nodes
  • Create the weight for each child node. This is measured by number_of_samples_in_child_node / number_of_samples_in_parent_node
  • Calculate the entropy of each child node
  • Create [weighted_average_of_entropy(children)] by calculating weight*entropy_of_child1 + weight*entropy_of_child2
  • Subtract this weighted entropy from the entropy of the parent node

The below code implements a simple information gain for a parent node being split into 2 child nodes

def information_gain(left_labels, right_labels, parent_entropy):
"""Calculate the information gain of a split"""
# calculate the weight of the left node
proportion_left_node = float(len(left_labels)) / (len(left_labels) + len(right_labels))
#calculate the weight of the right node
proportion_right_node = 1 - proportion_left_node
# compute the weighted average of the child node
weighted_average_of_child_nodes = ((proportion_left_node * entropy(left_labels)) + (proportion_right_node * entropy(right_labels)))
# return the parent node entropy - the weighted entropy of child nodes
return parent_entropy - weighted_average_of_child_nodes

Deconstructing the Decision Tree

With these concepts in mind, we are ready to implement a small decision tree!

Without any guidance, a decision tree will keep splitting nodes until all the final leaf nodes are pure. The idea of controlling the complexity of the tree is called pruning, and we can either prune the tree after it is completely built, or pre-prune the tree before the growing phase with certain parameter. Some ways to pre-prune tree complexity are to control the number of splits, limit the max depth (longest distance from root node to leaf node), or set an information gain .

The below code ties all these concepts back together

  • Start with an dataset where you have a target variable to predict
  • Compute entropy (or Gini impurity) of the original dataset (root node)
  • Look at each feature and compute the information gain
  • Choose the best feature that had the best information gain, which is the same as which feature caused the entropy to decrease the most
  • Keep growing until our stopping conditions are met — in this case it is our max depth limit and nodes having an entropy of 0.
import pandas as pd
import numpy as np
from math import log2

def entropy(data, target_col):
# calculate the entropy of the entire dataset
values, counts = np.unique(data[target_col], return_counts=True)
entropy = np.sum([-count/len(data) * log2(count/len(data)) for count in counts])
return entropy

def compute_information_gain(data, feature, target_col):
parent_entropy = entropy(data, target_col)
# calculate the information gain for splitting on a given feature
split_values = np.unique(data[feature])
# initialize at 0
weighted_child_entropy = 0
# compute the weighted entropy, remember that this is related to the number of points in the new node
for value in split_values:
sub_data = data[data[feature] == value]
node_weight = len(sub_data)/len(data)
weighted_child_entropy += node_weight * entropy(sub_data, target_col)
# same calculation as before, we just subtract the weighted entropy from the parent node entropy
return parent_entropy - weighted_child_entropy

def grow_tree(data, features, target_col, depth=0, max_depth=3):
# we set a max depth of 3 to "pre-prune" or limit the tree complexity
if depth >= max_depth or len(np.unique(data[target_col])) == 1:
# stop growing the tree if maximum depth is reached or all labels are the same. All labels being the same means the entropy is 0
return np.unique(data[target_col])[0]
# we compute the best feature (or best question to ask) based on information gain
node = {}
gains = [compute_information_gain(data, feature, target_col) for feature in features]
best_feature = features[np.argmax(gains)]

for value in np.unique(data[best_feature]):
sub_data = data[data[best_feature] == value]
node[value] = grow_tree(sub_data, features, target_col, depth+1, max_depth)

return node


# simulate some data and make a dataframe, note how we have a target
data = {
'A': [1, 2, 1, 2, 1, 2, 1, 2],
'B': [3, 3, 4, 4, 3, 3, 4, 4],
'C': [5, 5, 5, 5, 6, 6, 6, 6],
'target': [0, 0, 0, 1, 1, 1, 1, 0]
}
df = pd.DataFrame(data)

# define our features and label
features = ["A", "B", "C"]
target_col = "target"

# grow the tree
tree = grow_tree(df, features, target_col, max_depth=3)
print(tree)

To predict on this tree means to traverse your grown tree with your new data until it arrives at the leaf node. The final leaf node is the prediction.

Some Interesting Things About Random Forests

Everything we discussed in the last section was how an individual decision tree makes its decisions. The below image connects those concepts with the earlier concepts we discussed around sampling for Random Forests.

The Random Forest Architecture with a Deconstructed Decision Tree. Image by Author

Because a Decision Tree literally checks the information gain of each feature, you can calculate feature importance in a Random Forest. The calculation of a feature importance is usually seen as the average decrease in impurity across all the trees. Random Forests are not as interpretable as models like Logistic Regressions, so feature importance provide us with a little insight into how our trees grew.

And finally, there are a couple ways you can test your trained Random Forest. You can always do the classic Machine Learning approach and use a test set to measure how well your model generalizes to unseen data. However, that usually requires a second computation. Random Forest has this unique property called out-of-bag error, or OOB error. Remember how we only sample a part of our dataset to build each tree? You can actually use the rest of the samples to do a validation at the time of training, and this is really only possible because of the ensemble nature of the algorithm. This means that in one shot, we can understand how well our model generalizes to unseen data.

Conclusion and Final Thoughts

To summarize what we learned:

  • Random Forests are actually an ensemble of un-correlated Decision Trees making predictions and reaching a consensus. This consensus is an average of scores for regression problems and a majority rules for classification problems
  • Random Forests mitigate correlation by leveraging bagging and feature sampling. By leveraging both these techniques, the individual decision trees are viewing a particular dimension of our set and making predictions based on different factors.
  • Decision Trees are grown by splitting data on the feature that produces the highest information gain. Information gain is measure as the highest decrease in impurity. Impurity is usually measured via entropy or Gini impurity.
  • Random Forests are able to achieve a limited level of interpretability via feature importance, which is a measure of average information gain of a feature.
  • Random Forests also have the ability to a form of cross-validation at training time, a unique technique known as OOB error. This is possible because of the way the algorithm samples data upstream.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'], test_size=0.2, random_state=0)

# Train and score Random Forest
simple_rf_model = RandomForestClassifier(n_estimators=100, random_state=0)
simple_rf_model.fit(X_train, y_train)
print(f"accuracy: {simple_rf_model.score(X_test, y_test)}")

# accuracy: 0.93

When looking at the original code to train a Random Forest, it’s amazing to me just how many different computations and evaluations happen in these few lines of code. There are a multitude of considerations taken to prevent overfitting, evaluate both on a tree and forest level, and achieve some basic level of interpretability — plus it is very easy to set up thanks to all the frameworks out there.

I hope next time you train a Random Forest model, you are able to check out the scikit-learn documentation page on Random Forests and better understand all the options you have. While there are some intuitive defaults set, it should be clear just how many different tweaks you can make and how many of these techniques can extend to other models as well.

I had a lot of fun writing this post, and personally learned a ton about how this beautiful algorithm works. I hope that you take something away too!

--

--

I'm a husband, love playing basketball, and happen to have been working in ML for the last 10 years!