The world’s leading publication for data science, AI, and ML professionals.

Random forests – An ensemble of decision trees

This is how decision trees are combined to make a random forest

Photo by Filip Zrnzević on Unsplash
Photo by Filip Zrnzević on Unsplash

The Random Forest is one of the most powerful machine learning algorithms available today. It is a supervised machine learning algorithm that can be used for both classification (predicts a discrete-valued output, i.e. a class) and regression (predicts a continuous-valued output) tasks. In this article, I describe how this can be used for a classification task with the popular Iris dataset.

The motivation for random forests

First, we discuss some of the drawbacks of the Decision Tree algorithm. This will motivate you to use Random Forests.

  • Small changes to training data can result in a significantly different tree structure.
  • It may have the problem of overfitting (the model fits the training data very well but it fails to generalize for new input data) unless you tune the model hyperparameter of _max_depth_.

So, instead of training a single decision tree, it is better to train a group of decision trees which together make a random forest.

How random forests work behind the scenes

The main two concepts behind random forests are:

  • The wisdom of the crowd – a large group of people are collectively smarter than individual experts
  • Diversification – a set of uncorrelated tress

A random forest consists of a group (an ensemble) of individual decision trees. Therefore, the technique is called Ensemble Learning. A large group of uncorrelated decision trees can produce more accurate and stable results than any of individual decision trees.

When you train a random forest for a classification task, you actually train a group of decision trees. Then you obtain the predictions of all the individual trees and predict the class that gets the most votes. Although some individual trees produce wrong predictions, many can produce accurate predictions. As a group, they can move towards accurate predictions. This is called the wisdom of the crowd. The following diagram shows what actually happens behind the scenes.

Image by author
Image by author

To maintain a low correlation (high diversification) between individual trees, the algorithm automatically considers the following things.

  • Feature randomness
  • Bagging (bootstrap aggregating)

Feature randomness

In a normal decision tree, the algorithm searches very best feature out of all the features when it wants to split a node. In contrast, each tree in a random forest searches very best feature out of a random subset of features. This creates extra randomness when growing the tress inside a random forest. Because of feature randomness, the decision trees in a random forest are uncorrelated.

Image by author
Image by author

Bagging (bootstrap aggregating)

In a random forest, each decision tree is trained on a different random sample of the training set. When sampling is done with replacement, the method is called bagging (bootstrap aggregating). In statistics, resampling with replacement is called bootstrapping. The bootstrap method reduces the correlation between decision trees. In a decision tree, small changes to training data can result in a significantly different tree structure. The bootstrap method takes the advantage of this to produce uncorrelated trees. We can demonstrate the bootstrap method with the following simple example. The same thing applies in the random forest.

Imagine that we have a training set of 10 observations which are numbered from 1–10. Out of these observations, we perform sampling using the bootstrap method. We want to consider:

  • Sample size – In Machine Learning, it is common to use a sample size that is the same as the training set. In this example, the sample size is 10.
  • The number of samples – This is equal to the number of decision trees in the random forest.

To create the first sample, we randomly choose an observation from the training set. Let’s say it is the 5th observation. This observation is returned to the training dataset and we repeat the process until we make the entire sample. After the entire process, imagine that we make the first sample with the following observations.

Sample_1 = [5, 4, 6, 6, 5, 1, 3, 2, 10, 9]

Then we train a decision tree with this sample. Because of the replacement, some observations may appear more times in the sample. Also, note that some observations don’t appear at least 1 time in the sample. Those observations are called out-of-bag (oob) observations. The oob observations for the first sample are:

oob_1 = [7, 8]

The decision tree corresponding to sample 1 never sees those oob observations during the training process. So, this set of oob observations can be used as a validation set for that decision tree. We can evaluate the entire ensemble by averaging out the oob evaluations of each decision tree. This is called the out-of-bag evaluation which is an alternative to cross-validation.

Let’s create another sample.

Sample_2 = [5, 4, 4, 5, 5, 1, 3, 2, 10, 9]

oob_2 = [6, 7, 8]

Likewise, we create a number of samples that is equal to the number of decision trees in the Random Forest.

Image by author
Image by author

Feature importance in a random forest

Another great advantage of a random forest is that it allows you to get an idea about the relative importance of each feature based on a score computed during the training phase. For this, he Scikit-learn RandomForestClassifier provides an attribute called _featureimportances_. This returns an array of values which sum to 1. The higher the score, the more important the feature. The score is calculated based on the Gini impurity which measures the quality of a split (the lower the Gini, the better the split). Features with splits that have a greater mean decrease in Gini impurity are considered more important.

By looking at the feature importance, you can decide which features to drop because they don’t contribute enough for making the model. This is important because of the following reasons.

  • Removing the least important features will increase the accuracy of the model. This is because we remove the noise by removing unnecessary features
  • By removing the unnecessary features, you will avoid the problem of overfitting.
  • A lesser amount of features also reduces training time.

Enough theory! Let’s get our hands dirty by writing some Python code to train a random forest for our Iris dataset.

About the Iris dataset

The Iris dataset (download here) has 150 observations and 4 numeric attributes. The target column (species) consists of the classes for each observation. There are 3 classes (0 – setosa, 1 – versicolor, 2 – virginica).

First 5 rows of the Iris dataset (Image by author)
First 5 rows of the Iris dataset (Image by author)

The dataset has no missing values and all the features are numerical. This means that the dataset is ready to use without any pre-processing!

Iris dataset info (Image by author)
Iris dataset info (Image by author)

Train a random forest classifier for the Iris dataset

After running the following code, you will get the model accuracy score of 0.97.

There are 100 trees in our random forest. This is because we have set n_estimators=100. So, the number of bootstrapped samples are also 100.

Out-of-bag (oob) evaluation

In random forests, each decision tree is trained using a bootstrapped subset of observations. Therefore, every tree has a separate subset of out-of-bag (oob) observations. We can use oob observations as a validation set to evaluate the performance of our random forest.

This value is close to the model accuracy score which is 0.97.

Visualizing feature importance

By looking at the feature importance, we can decide to drop the sepal width (cm) feature because it does not contribute enough for making the model.

Final thoughts

Tree-based models such as DecisionTreeClassifier and RandomForestClassifier are mostly used machine learning algorithms for classification tasks. If you want to interpret the model for "why is the model predicts that kind of a class", it is better to use a normal decision tree algorithm instead of a random forest. This is because a single decision tree is easily interpretable. But keep in mind that, as we discussed earlier, there are some drawbacks of the decision tree algorithm.

When you use the Random Forest Algorithm, do the followings in the specified order.

  1. First, pre-process the data by handling the missing values and converting categorical values into numeric ones.
  2. Then, split the dataset into train and test tests – Never use the same data for both training and testing. Doing so will allow the model to memorize the data rather than learning any pattern.
  3. Set the model hyperparameters in the RandomForestClassifier as described below. Always consider the balance between the performance of the algorithm and the training speed. For example, if you include more trees in the forest, the performance is high and the speed is slow.
  4. Then train your model and visualize feature importances.
  5. Remove less-important features (if any) and retrain the model using the selected features.
  6. Test your model using the test set and get the accuracy score.

Selecting the model hyperparameters

  • n_estimators: The number of trees in the forest. **** The default is 100. You may use a number that is equal to the number of observations in your training set.
  • max_depth: The maximum depth of the tree. The default is none. You may first train a DecisionTreeClassifier and perform a hyperparameter tuning for the _max_depth. After you obtain the best value through cross-validation and grid search (I have done this and obtained the value of 3), you can use that value for max_depth in ****_ the RandomForestClassifier.
  • bootstrap: The default is True. Use this default to perform bootstrap sampling to get uncorrelated trees.
  • oob_score: The default is False. Set this to True if you want to perform Out-of-bag (oob) evaluation which is an alternative to cross-validation.

To access the Scikit-learn official documentation for RandomForestClassifier, simply execute help(RandomForestClassifier) after you import the class as from sklearn.ensemble import RandomForestClassifier.

Thanks for reading!

This tutorial was designed and created by Rukshan Pramoditha, the Author of Data Science 365 Blog.

Technologies used in this tutorial

  • Python (High-level programming language)
  • pandas (Python data analysis and manipulation library)
  • matplotlib (Python data visualization library)
  • seaborn (Python advanced data visualization library)
  • Scikit-learn (Python machine learning library)
  • Jupyter Notebook (Integrated Development Environment)

Machine learning used in this tutorial

  • Random forest classifier

2020–10–29


Related Articles