A Visual Guide to Random Forests

An intuitive visual guide and video explaining a powerful ensembling method

Cheshta Dhingra
Towards Data Science

--

One of the most deceptively obvious questions in machine learning is “are more models better than fewer models?” The science that answers this question is called model ensembling. Model ensembling asks how to construct aggregations of models that improve test accuracy while reducing the costs associated with storing, training and getting inference from multiple models.

We will explore a popular ensembling method applied to decision trees: Random Forests.

In order to illustrate this, let’s take an example. Imagine we’re trying to predict what caused a wildfire given its size, location, and date.

The basic building blocks of the random forest model are decision trees, so if you want to learn how they work, I recommend checking out my previous post. As a quick refresher, decision trees perform the task of classification or regression by recursively asking simple True or False questions that split the data into the purest possible subgroups.

Now back to random forests. In this method of ensembling, we train a bunch of decision trees (hence the name “forest”) and then take a vote among the different trees. One tree one vote.

In the case of classification, each tree spits out a class prediction and then the class with the most votes becomes the output of the random forest.

In the case of a regression, a simple average of each individual tree’s prediction becomes the output of the random forest.

The key idea behind random forests is that there is wisdom in crowds. Insight drawn from a large group of models is likely to be more accurate than a prediction from any one model alone.

Simple right? Sure, but why does this work? What if all our models learn the exact same thing and vote for the same answer? Isn’t that equivalent to just having one model make the prediction?

Yes, but there’s a way to fix that.

But first, we need to define a word that will help explain: uncorrelatedness. We need our decision trees to be different from each other. We want them to disagree on what the splits are and what the predictions are. Uncorrelatedness is important for random forests. A large group of uncorrelated trees working together in an ensemble will outperform any of the constituent trees. In other words, the forest is shielded from the errors of individual trees.

There are a few different methods to ensure our trees are uncorrelated:

The first method is called “bootstrapping”. Bootstrapping is creating smaller datasets out of our training set through sampling. Now, with normal decision trees, we feed the entire training set to the tree and allow it to generate a prediction. However with bootstrapping, we allow each tree to randomly sample the training data with replacement, resulting in different trees. When we allow replacement, some observations may be repeated in the sample. Often, the sample size of the bootstrap is the same as the size of the original dataset, but it is possible to sample subsets of the dataset for the sake of computational efficiency. Using bootstrapping to create uncorrelated models, and then aggregating their results is called bootstrap aggregating, or bagging for short.

The second way to introduce variation in our trees is by shuffling which features each tree can split on. This method is called Feature Randomness. Remember, with basic decision trees, when it’s time to split the data on a node, the tree considers each possible feature and picks the one that leads to the purest subgroups. However, with random forests, we limit the number of features that each tree can even consider splitting on. Some libraries randomize features at the split level rather than the tree level. This does not matter if we assume the trees are decision stumps, meaning there is only 1 split, or max depth = 1. In both cases, the goal is to limit the number of possible features in order to decorrelate the individual trees.

Because the individual trees are very simple and they are only trained on a subset of the training data and feature set, training time is very low so we can afford to train thousands of trees. Random Forests are widely used in academia and industry. Now that you understand the concept, you’re almost ready to implement a random forest model to use with your own projects! Stay tuned for the Random Forests coding tutorial and for a new post on another ensembling method — Gradient Boosted Trees!

Check out the video below to see everything you learned in action!

https://youtu.be/cIbj0WuK41w

--

--

Passionate about data science, healthcare, and exploring complex concepts through a visual narrative.