The world’s leading publication for data science, AI, and ML professionals.

Mastering Random Forests: A comprehensive guide

Random Forests are one of the most powerful algorithms in machine learning. In this article we will take a code first approach towards…

Random Forests are one of the most powerful algorithms that every data scientist or machine learning engineer should have in their toolkit. In this article, we will take a code-first approach towards understanding everything that sklearn’s Random Forest has to offer!

Photo by Kunal Shinde on Unsplash
Photo by Kunal Shinde on Unsplash

Decision Trees

To understand Random Forest, it is essential to understand what they are made from. Decision trees are the foundational building blocks of all tree-based algorithms. Every other tree-based algorithm is a sophisticated ensemble of decision trees. Thus understanding the aspects of decision trees would be a good place to start.

A decision tree trained on "medical-appointment dataset" [Image By Author]
A decision tree trained on "medical-appointment dataset" [Image By Author]

Since Decision Trees can be any depth (if the depth hasn’t been explicitly specified), Decision Trees tend to overfit every data point. This would result in a 100% accuracy however the model does not generalize to data that it has not seen during training, hence there will be a large difference between training and validation accuracy.

Perfect accuracy on the training dataset [Image By Author]
Perfect accuracy on the training dataset [Image By Author]
Poor accuracy on the test dataset [Image By Author]
Poor accuracy on the test dataset [Image By Author]

The underlying idea of a decision tree is solid, but due to a lack of model complexity, they tend to perform poorly. Every tree-based algorithm attempts to solve this problem by adding additional layers of complexity.

Random Forests

Just like how a forest is a collection of trees, Random Forest is just an ensemble of decision trees. Let’s briefly talk about how random forests work before we go into its relevance in machine Learning.

Let’s say we are building a random forest classifier with 15 trees. The random forest runs the data point through all 15 trees.

The prediction of each tree can be considered as a ‘Vote’, and the class with the maximum number of votes is the prediction of the random forest. Sounds pretty simple right?? This is one of the most powerful Machine Learning algorithms out there, and its potential is truly endless.

How to improve the performance of random forests:

The random forest model provided by the sklearn library has around 19 model parameters. The most important of these parameters which we need to tweak, while hyperparameter tuning, are:

  • n_estimators: The number of decision trees in the random forest.
  • max_depth: The number of splits that each decision tree is allowed to make. If the number of splits is too low, the model underfits the data and if it is too high the model overfits. Generally, we go with a max depth of 3, 5, or 7.
  • max_features: The number of columns that are shown to each decision tree. The specific features that are passed to each decision tree can vary between each decision tree.
  • Bootstrap: A bootstrapped model takes only a select subset of columns and rows to train each decision tree. Thus the model becomes less prone to overfitting the data.

    Note: By applying bootstrap, the data set is split into in-bag and out-of-bag(OOB) data sets. This removes the need for creating a validation dataset.

  • max_samples: If bootstrap is set to true, the maximum number of rows that can be passed to each decision tree is controlled by the max_samples parameter.

A suggested parameter grid that you can use for hyperparameter tuning:

Alright now that we know how to build a random forest, let’s get into how we can use them to their full potential!

Uses of Random Forests:

Unlike many other machine learning algorithms, Random Forests can be used for a lot more than just its predictive ability.

  • Ease Of Building:

Random Forests do not have as many model assumptions as regression-based algorithms or support vector machines. This allows us to quickly build random forests to establish a base score to build on.

Furthermore, random forests give state-of-the-art accuracies even without hyperparameter tuning. The process of hyperparameter tuning is much less tedious as compared to larger models such as XGBoost.

  • Feature Importance

Random Forest class provided by the sklearn library allows you to get the feature importance of each of the columns in the dataset. This is particularly useful while working with business clients. Let’s look further into this by considering an example:

Let’s assume that our client is an imaginary Real Estate company (XYZ inc.). This company approaches us with a relatively small dataset of all the sales that have occurred in their city over the last 5 years, with a request to help focus their market campaigns in the right areas and to predict the price of the property with great certainty. The features in the dataset provided by them are- Age of the buyer, Location of property, Land Area of Property, Age of the house/structure, and Cost of property(dependent variable)

A great way to approach this would be through random forests. We would have to start by building a random forest and then tune the hyperparameters that are mentioned above.

Proceeding this, we can calculate feature importance through random forests, we can easily find out which of the features contributes the most to the model prediction. Let’s say that the random forest tells us that the most important feature was the age of the buyer. With this information, the real estate company would now be able to target their marketing campaigns to a very select niche of people based on their age.

This whole process would take a matter of hours to get done with random forests, whereas the same would be very tedious with other feature selection algorithms such as Boruta or by comparing the individual slopes of a regression model.

  • Feature Selection

This can be considered as an extension of the previous point. By calculating the feature importance, we can drop the less important features and thereby decrease the dimensionality of our model, improving both the accuracy and reducing training time.

Another way of performing feature selection is by shuffling individual features in the data set recursively so that they lose the information provided by the column is destroyed. The model is evaluated on this modified data set to see how the scores have been impacted. The more important the feature, the more profound its impact on the score.

CONCLUSION

In this article, we have extensively studied Random Forest- parameters, hyperparameter tuning, and reasons why random forests are still very relevant in business use cases with the help of an example.

Other sources to understand Random Forests:


Related Articles