
In this post, we will explain what a Random Forest model is, see its strengths, how it is built, and what it can be used for.
We will go through the theory and intuition of Random Forest, seeing the minimum amount of maths necessary to understand how everything works, without diving into the most complex details.
Lastly, before we start, here you have some additional resources to skyrocket your Machine Learning career:
Awesome Machine Learning Resources:
- For learning resources go to How to Learn Machine Learning!
- For professional resources (jobs, events, skill tests) go to AIgents.co - A career community for Data Scientists & Machine Learning Engineers.
1. Introduction
In the Machine Learning world, Random Forest models are a kind of non parametric models that can be used both for regression and classification. They are one of the most popular ensemble methods, belonging to the specific category of Bagging methods.
Ensemble methods involve using many learners to enhance the performance of any single one of them individually. These methods can be described as techniques that use a group of weak learners (those who on average achieve only slightly better results than a random model) together, in order to create a stronger, aggregated one.
In our case, Random Forests are an ensemble of many individual Decision Trees. If you are not familiar with Decision Trees, you can learn all about them here:
One of the main drawbacks of Decision Trees is that they are very prone to over-fitting: they do well on training data, but are not so flexible for making predictions on unseen samples. While there are workarounds for this, like pruning the trees, this reduces their predictive power. Generally they are models with medium bias and high variance, but they are simple and easy to interpret.
If you are not very confident with the difference between bias and variance, check out the following post:
Random Forest models combine the simplicity of Decision Trees with the flexibility and power of an ensemble model. In a forest of trees, we forget about the high variance of an specific tree, and are less concerned about each individual element, so we can grow nicer, larger trees that have more predictive power than a pruned one.
Although Random Forest models don’t offer as much interpret ability as a single tree, their performance is a lot better, and we don’t have to worry so much about perfectly tuning the parameters of the forest as we do with individual trees.
Okay, I get it, a Random Forest is a collection of individual trees. But why the name Random? Where is the Randomness? Lets find out by learning how a Random Forest model is built.
2. Training and Building a Random Forest
Building a random Forest has 3 main phases. We will break down each of them and clarify each of the concepts and steps. Lets go!
2.1 Creating a Bootstrapped Data Set for each tree
When we build an individual decision tree, we use a training data set and all of the observations. This means that if we are not careful, the tree can adjust very well to this training data, and generalise badly to new, unseen observations. To solve this issue, we stop the tree from growing very large, usually at the cost of reducing its performance.
To build a Random Forest we have to train N decision trees. Do we train the trees using the same data all the time? Do we use the whole data set? Nope.
This is where the first random feature comes in. To train each individual tree, we pick a random sample of the entire Data set, like shown in the following figure.

From looking at this figure, various things can be deduced. First of all, the size of the data used to train each individual tree does not have to be the size of the whole data set. Also, a data point can be present more than once in the data used to train a single tree (like in tree nº two).
This is called Sampling with Replacement or Bootstrapping: each data point is picked randomly from the whole data set, and a data point can be picked more than once.
By using different samples of data to train each individual tree we reduce one of the main problems that they have: they are very fond of their training data. If we train a forest with a lot of trees and each of them has been trained with different data, we solve this problem. They are all very fond of their training data, but the forest is not fond of any specific data point. This allows us to grow larger individual trees, as we do not care so much anymore for an individual tree overfitting.
If we use a very small portion of the whole data set to train each individual tree, we increase the randomness of the forest (reducing over-fitting) but usually at the cost of a lower performance.
In practice, by default most Random Forest implementations (like the one from Scikit-Learn) pick the sample of the training data used for each tree to be the same size as the original data set (however it is not the same data set, remember that we are picking random samples).
This generally provides a good bias-variance compromise.
2.2 Train a forest of trees using these random data sets, and add a little more randomness with the feature selection
If you remember well, for building an individual decision tree, at each node we evaluated a certain metric (like the Gini index, or Information Gain) and picked the feature or variable of the data to go in the node that minimised/maximised this metric.
This worked decently well when training only one tree, but now we want a whole forest of them! How do we do it? Ensemble models, like Random Forest work best if the individual models (individual trees in our case) are uncorrelated. In Random Forest this is achieved by randomly selecting certain features to evaluate at each node.

As you can see from the previous image, at each node we evaluate only a subset of all the initial features. For the root node we take into account E, A and F (and F wins). In Node 1 we consider C, G and D (and G wins). Lastly, in Node 2 we consider only A, B, and G (and A wins). We would carry on doing this until we built the whole tree.
By doing this, we avoid including features that have a very high predictive power in every tree, while creating many un-correlated trees. This is the second sweep of randomness. We do not only use random data, but also random features when building each tree. The greater the tree diversity, the better: we reduce the variance, and get a better performing model.
2.3 Repeat this for the N trees to create our awesome forest.
Awesome, we have learned how to build a single decision tree. Now, we would repeat this for the N trees, randomly selecting on each node of each of the trees which variables enter the contest for being picked as the feature to split on.
In conclusion, the whole process goes as follows:
- Create a bootstrapped data set for each tree.
- Create a decision tree using its corresponding data set, but at each node use a random sub sample of variables or features to split on.
- Repeat all these three steps hundreds of times to build a massive forest with a wide variety of trees. This variety is what makes a Random Forest way better than a single decision tree.
Once we have built our forest, we are ready to use it to make awesome predictions. Lets see how!
3. Making predictions using a Random Forest
Making predictions with a Random Forest is very easy. We just have to take each of our individual trees, pass the observation for which we want to make a prediction through them, get a prediction from every tree (summing up to N predictions) and then obtain an overall, aggregated prediction.
Bootstrapping the data and then using an aggregate to make a prediction is called Bagging, and how this prediction is made depends on the kind of problem we are facing.
For regression problems, the aggregate decision is the average of the decisions of every single decision tree. For classification problems, the final prediction is the most frequent prediction done by the forest.

The previous image illustrates this very simple procedure. For the classification problem we want to predict if a certain patient is sick or healthy. For this we pass his medical record and other information through each tree of the random forest, and obtain N predictions (400 in our case). In our example 355 of the trees say that the patient is healthy and 45 say that the patient is sick, therefore the forest decides that the patient is healthy.
For the regression problem we want to predict the price of a certain house. We pass the characteristics of this new house through our N trees, getting a numerical prediction from each of them. Then, we calculate the average of these predictions and get the final value of 322.750$.
Simple right? We make a prediction with every individual tree and then aggregate these predictions using the mean (average) or the mode (most frequent value).
4. Conclusion and other resources
In this post we have seen what a Random Forest is, how it overcomes the main issues of Decision Trees, how they are trained, and used to make predictions. They are very flexible and powerful Machine Learning models that are highly used in commercial and industrial applications, along with Boosting models and Artificial Neural Networks.
On future posts we will explore tips and tricks of Random Forests and how they can be used for feature selection. Also, if you want to see precisely how they are built, check out the following video by StatQuest, its great:
That is it! As always, I hope you enjoyed the post.
For further resources on Machine Learning and Data Science check out the following repository: How to Learn Machine Learning! For career resources (jobs, events, skill tests) go to AIgents.co – A career community for Data Scientists & Machine Learning Engineers.
Thank you very much for reading, and have a great day!