The world’s leading publication for data science, AI, and ML professionals.

Random Forest Explained

Understanding & Implementation of Decision Tree & Random Forest

Image from Splash
Image from Splash

Decision Trees and Random Forests are robust algorithms commonly used in the industry because of their ease of interpretability and performance. One of the strongest attributes to this algorithm is that it allows users to see which features contribute the most to the prediction and its importance based on the depth of the tree.

This article will provide a conceptual understanding of the decision tree and random forest algorithms. Although this algorithm is robust enough for both classification and regression based problems, this article will focus on the classification based examples. You can apply a similar thought process described below for regression based problems, but these algorithms will be out performed by other algorithms (like logistic regression) which specifically focus on those tasks. I will also show the reader on how to implement both the random forest and decision tree algorithms in Python using the sklearn library on the iris dataset for flower classification. Below I’ve outlined the structure this article will follow.

Table of Contents

  • Decision Tree
  • Implementation
  • Advantages
  • Disadvantages
  • Random Forest
  • Implementation
  • Advantages
  • Disadvantages
  • Summary
  • Resources

Decision Tree

Decision Trees are attractive models for businesses if their focus is on interpretability of results. Just as the name suggests, decision tree is a type of model which breaks down the given input data through decisions based on asking a series of questions. Imagine a directed, acyclic graph where each node is represented as a decision and the edges connect a pair of decisions.

Example

We can outline this in the image below, in the following example our given problem is to determine whether or not an individual should purchase a new laptop. This is a classification example where buy and don't buy are our given labels.

Decision tree classifier example on determining whether a user should purchase a new laptop or not (Image provided by author)
Decision tree classifier example on determining whether a user should purchase a new laptop or not (Image provided by author)

Based on the features we train the tree on, the model will learn a series of questions to infer the class labels of the given data. This same concept can be used for non categorical data like integers. This algorithm will begin at the root node (the topmost node – in our case Is the Current Laptop Broken) and split the data on the feature which results in the largest information gain. There are various different approaches to split the tree, another common one is by minimizing the Gini Impurity. We can repeat the splitting process iteratively at each child node until it is impossible to split further. As you can imagine this process is prone to overfitting as it can result in a very deep tree with many nodes. The typical approach to remedy this is by pruning the tree to prevent overfitting. The pruning process is quite simple in most scenarios, the easiest way would be to set a limit on the maximum depth of the tree.

Implementation

For more details on this decision tree and the functions being used from the sklearn library refer to this [3].

This script will yield two outputs. The first will be a visual representation of the decision tree formed to classify the iris dataset and the second will be the confusion matrix which will outline the accuracy of the classifier. The following article provides a strong understanding of how accuracy, precision, recall and the f1 score is used [4].

Advantages

  • Easy to understand
  • Easy to implement
  • Strong performance for classification based problems
  • Non parametric

Disadvantages

  • High variance, small change in data can result in a large change in the structure of the tree and decisions being made
  • Prone to overfitting
  • Although it performs highly for classification, for regression based problems, the performance will drastically deteriorate
  • Depending on the number of classes and size of the data set the calculation time will drastically increase

Random Forest

Random forests have recently gained massive popularity in Machine Learning in the recent over the past decade. This is because of its strong performance in classification, ease of use and scalability. Random forest is an ensemble of decision trees. Ensemble learning is a method which uses multiple learning algorithms to boost predictive performance [1]. This method will allow a better generalization of the model and is less prone to overfitting.

Algorithm

  1. Draw a random bootstrap sample of size n (randomly choose n samples from the training set with replacement).
  2. Grow a decision tree from the bootstrap sample. At each node:
  3. Randomly select d features without replacement.
  4. Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.
  5. Repeat the steps 1 to 2 k times.

This will aggregate the prediction by each tree to assign the class label by majority vote.

Random forest classification visualization - ensemble of many decision trees to generate prediction using majority vote (Image provided by author)
Random forest classification visualization – ensemble of many decision trees to generate prediction using majority vote (Image provided by author)

Implementation

For more details on the random forest classifier and the functions being used from the sklearn library refer to this [2].

This script will output a confusion matrix associated with the model. This matrix will outline the accuracy of the model through metrics like f1 score, precision and recall associated to each class.

Advantages

  • Strong performance for classification tasks
  • Unlikely to be overfit to the data (does not require pruning or much hyper-parameter tuning)

Disadvantages

  • Slow to train – when dealing with large datasets the computation complexity to train the model is very high
  • Harder to interpret

Summary

In summation, this article outlines that the decision tree algorithm can be viewed as a model which breaks down the given input data through decisions based on asking a series of questions. Whereas, the random forest algorithm uses ensemble learning to generate a model which uses a group of decision trees together to generate a prediction by majority vote. One important take away would be that both of the models have a high performance when dealing with classification problems.

Resources

[1] https://en.wikipedia.org/wiki/Ensemble_learning

[2] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[3] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

[4] https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/


Other articles I’ve written which you might be interested in:

Recommendation Systems Explained

Link Prediction Recommendation Engines with Node2Vec

Text Summarization in Python with Jaro-Winkler and PageRank

Monte Carlo Method Explained

Markov Chain Explained

Random Walks with Restart Explained

K Nearest Neighbours Explained


Related Articles