Enchanted Random Forest

Jose Marcial Portilla
Towards Data Science
8 min readJul 9, 2015

--

A quick guide to Decision Trees and Random Forests.

by Jose Marcial Portilla

If you enjoy this article and wish to learn more about how to implement machine learning with Python, check out my online course!

This post will take you through a basic explanation of Decision Trees and Random Forests. Starting with simple analogies and slowly adding math along the way.

Analogy to Reality

Let’s start off with a quick story so we can get a feel for the framework of decision trees and ensemble methods. Throughout the story, the analogous machine learning terms are presented in parenthesis.

Imagine you are at the library and are trying to decide on a book to read. Luckily, your friend Frank is with you to help you decide. Frankie begins by asking you for a list of books you’ve read and whether or not you enjoyed them (a training set with labels). Then you pick a book from the shelf and Frank begins asking you questions to determine whether or not you’ll enjoy the book. Frank asks “yes” or “no” questions, similar to the game 20 questions, such as “Who is the author?”, or “What year was the book written?”. Frank asks the more informative questions first (maximizing information gain) and then finally gives you his yes or no recommendation on the book at the end (classifying the test set).

Here, we can treat Frank as a decision tree.

However, Frank is only one person, and his questions sometimes hone in too much on particular books (overfitting). To get a better idea of whether or not you’ll like the book from the shelf, you ask several friends through the same process. They each give a vote on the book, and you decide on the majority opinion (an ensemble classifier).

If you have a similar circle of friends, they may all have the exact same process of questions, so to avoid them all having the exact same answer, you’ll want to give them each a different sample from your list of books. You decide to cut up your list and place it in a bag, then randomly draw from the bag, tell your friend whether or not you enjoyed that book, and then place that sample back in the bag. This means you’ll be randomly drawing a sub sample from your original list with replacement (bootstrapping your original data). This gives some books more emphasis, if you drew a particular book several times for one friend, and some books less, possibly never drawn from the bag. Then each individual will give a unique recommendation on your book preferences.

Now your friends form a bootstrapped aggregated forest.

We still have one issue to resolve though. Imagine you enjoyed both Fahrenheit 451 and The Martian Chronicles by Ray Bradbury. There are several things linking these two books together, the same author, similar genre, and they were both written in the 1950s. One of your friends may come to the conclusion that you enjoyed both books because you enjoy books written in the 1950s, when in reality, Bradbury might be the only author from the 50s you like. So how do we fix this issue? We can fix this by introducing some randomness to the questions your friends get to ask. You force your friends to randomly choose the question they were going to ask (at each node of the decision tree, you randomly select the attribute to split on). It is important to note that previously, you were inducing randomness with your data by bootstrapping, now you are introducing it into your actual model-thus your large group of friends asking questions in random order is our analogy to the random forest.

Decision Trees

Now that we have a conceptual understanding, let’s start diving deeper into how we construct Decision Trees.

Let’s start by looking at a simple example of a decision tree based off a few splits of data from passengers on the Titanic.

Here we can see this Decision Tree is splitting the passengers based on both categorical and continuous data to predict if they survived the crash. The first node splits the passengers based on gender. You can see that this simple decision tree decides that if the passenger was female they survived the crash. It also separates along continuous features by indicating a point to split on, such as age or number of siblings. So now an important question emerges.

How do we decide which features to split on?

Information Gain

In order to pick which feature to split on, we need a way of measuring how good the split is. This is where information gain and entropy come in.

We would like to choose questions that give a lot of information about the tree’s prediction. For example, if there is a single yes/no question that accurately predicts the outputs 99% of the time, then that question allows us to “gain” a lot of information about our data. In order to measure how much information we gain, we introduce entropy.

The entropy is a measure of uncertainty associated with our data. We can intuitively think that if a data set had only one label (e.g. every passenger survived), then we have a low entropy. So we would like to split our data in a way that minimizes the entropy. The better the splits, the better our prediction will be. The equation for entropy is:

The entropy equation. Introduced by Claude Shannon in 1948.

Here, p(x) is the percent of the group that belongs to a given class and H is the entropy.

If you have a collection of data points, the entropy will be large when they are evenly distributed across the classes and small when they are mostly the same class. Here’s a graph to demonstrate what entropy looks like:

Entropy versus Probability of belonging to a class.

Intuitively this makes sense- imagine that a split on your tree equally divided your data with a probability of 50%. This would result in the highest possibly entropy because your data is evenly distributed across the classes. Conversely, as said before, entropy will be small when the data mostly belongs to the same class. In our plot, this is shown with a low entropy when the data has either close to a 0% or 100% chance of being in a class.

So we would like our decision tree to make splits that minimize entropy. We use information gain to determine the best split. Information gain is calculated by the following equation:

Information Gain.

Here, S is the original set and D is the splitting of the set (a partition). Each V is a subset of S. All of the V’s are disjoint and make up S. So the Information gain is just the original Entropy of the data before the split H(S), minus the sum of the weighted split entropy values.

So now that we know about decision trees and how we construct them, what are their pros and cons of decision trees?

For the pros, Decision Trees are easily interpretable and can handle missing values and outliers. They can also handle discrete and continuous data types, along with irrelevant features.

For the cons, Decision Trees can be very easy to overfit, and while they are computationally cheap for prediction, training the decision tree can be computationally expensive.

Now that we have gotten a grasp on Decision Trees, let’s explore Random Forests.

Random Forests

Random Forests are an example of an ensemble method, in which we combine multiple machine learning algorithms to obtain better predictive performance. We’ll run multiple models on the data and use the aggregate predictions, which will be better than a single model alone.

Random Forest is one of the most common ensemble methods, which consists of a collection of Decision Trees. Random Forest was developed by Leo Breimen, and I highly suggest you check out his webpage on them.

The idea behind a Random Forest is actually pretty simple: We repeatedly select data from the data set (with replacement) and build a Decision Tree with each new sample. It is important to note that since we are sampling with replacement, many data points will be repeated and many won’t be included as well. This is important to keep in mind when we talk about measuring error of a Random Forest. Another important feature of the Random Forest is that each node of the Decision Tree is limited to only considering splits on random subsets of the features.

In the case of classification with Random Forests, we use each tree in our forest to get a prediction, then the label with the most votes becomes the predicted class for that data point. The usual parameters when building a forest (standard defaults used in the SciKit-Learn library) are 10 trees and only considering the square root of ‘n’ features, where n is the total number of features.

Determining Error

If we wanted to know how well our Random Forest performed, we could use a standard cross validation method of splitting the data into a training and testing set, then comparing the predictions to the actual values. However, as mentioned earlier, each tree doesn’t get a chance to see all of the training data, so the unseen data can be used to cross validate each tree individually.

When selecting from the data set, about one third of the data is left out, (the mathematics behind this are discussed in further detail here). So essentially every data point can be tested with about 1/3 of the trees. After calculating the percent of these data points with a correct prediction we get the out-of-bag error.

For simple cases this out-of-bag error is sufficient in judging the performance of the Random Forest model.

Conclusions

Hopefully you’ve gained an intuitive understanding of the basics behind Decision Trees and Random Forests. There is still a lot more to cover (such as pruning, boosting, and much more) and hopefully I’ll cover this and more in a future post. For more information check out the extensive wikipedia page on the topics covered here. Thanks for reading!

--

--