Random Forest and Decision Trees

You have already heard about algorithms used to predict diseases, identify figures and faces, or even our behavior. But do you know any of these algorithms and how they work?

Éverton Bin
Towards Data Science
6 min readOct 18, 2020

--

Photo by @ugodly on Unsplash

One of the most popular (and powerful) algorithms widely used to solve problems in Data Science is the one called Random Forest. No need to say there’s a lot of math and computer programming behind the scenes, but let’s make it simple so you can follow the topic even if you have never heard of it before.

Random Forest derives from the idea of another algorithm known as Decision Tree, and it basically uses the structure that resembles a tree and its branches to make decisions that lead to a final prediction.

But how does this actually work? The first thing you need to know is that there is no magic. When training an algorithm, there must be historical data available. It’s from the data that the algorithm is going to learn. Suppose we want to create a model capable of identifying objects given some specific characteristics such as shape, color, height, and so on.

To simplify, let’s assume that these objects are specifically three different fruits: bananas, oranges, and lemons. How would a Decision Tree algorithm face this task?

To understand the procedure, let’s imagine that we were given the task to teach a two years old kid to differentiate the same three types of fruits. One first good approach would be telling the kid to pay attention to the shape of the fruits: while oranges and lemons have a rounded shape, bananas have an elongated shape that clearly differentiates them from the other two fruits.

However, oranges and lemons seem to be more alike at first sight and we would have to consider some other characteristics to complete our mission of teaching our kid to precisely identify those three fruits.

(Top) Photo by Shaun Meintjes on Unsplash | (Bottom Left) Photo by Elena Koycheva on Unsplash | (Bottom Right) Photo by Brett Jordan on Unsplash

A good parameter to differentiate lemons from oranges could be their color. Assuming lemons are usually green and oranges usually… orange, that would be a good second parameter that would almost always lead our clever kid to correctly identify the fruits. And in case our orange is not that ripe and still kind of green, maybe we could compare sizes and if it’s bigger than a certain measure we say it’s an orange, otherwise, we classify it as a lemon.

It may look like a silly situation, but that’s exactly the procedure of a Decision Tree algorithm. Among all the features available describing our fruits, it will dive into the data and choose the most important feature that best separates one class from the others and follow the same logic for the remaining features, until it finds a way to better decide which fruit it is.

That is the reason why Decision Trees algorithms can also be used as a tool for feature selection, which in this example would be identifying characteristics with more importance to our identification and excluding the ones that contribute little or none to our primary goal.

For example, if we had the information of who bought each one of the fruits we have in our examples, it would probably be irrelevant information to classify which fruit we are talking about, and these trees’ structures are pretty good in identifying it.

Image by Author

What if we could put together different trees that observe those fruits from several different angles to better understand how they distinguish from each other? That’s exactly the idea of a Random Forest algorithm.

It creates a series of decision trees, each one looking for slightly different compositions of the same examples of fruits and also looking at them through different perspectives: for example, one decision tree is built using color as the first parameter, while one other tree uses height.

It would only make sense if we guarantee that one tree is not equal to the other, or else we wouldn’t have the benefit from the different ‘points of view’ and the computational power required to create a set of trees would not be justified. To guarantee a set of distinct trees, some parameters can be controlled (considering we are using a framework available in libraries like Scikit-Learn and not building it from the scratch).

First of all, our algorithm will select a different set of examples for each tree. So let’s say our original data had 1000 fruits’ examples available: the algorithm will randomly select 1000 examples for each one of the trees, meaning that some examples may happen to be selected more than once in the same set while other examples may be missing.

Besides that, we can control parameters like the number of max_features. This parameter will set a number, let’s say 2, and then the algorithm will select two random characteristics for each tree and decide which of them is the most important to distinguish the different fruits.

These random factors virtually guarantee that all the decision trees are going to be different. Another important parameter is the n_estimatorsthat allows the user to choose the number of decision trees to be created.

Image by Author

Intuitively, a series of decision trees working together will perform better when compared to a single decision structure and that exemplifies the power of using the Random Forest algorithm. Once we have built a forest, how do we make the different decision trees work together?

If we want our model to predict which fruit it is, given a new set of characteristics, each one of the trees will make a prediction and this prediction will be associated with one probability for each class. For example, one tree could conclude that the characteristics given are 80% likely to be a banana, 12% to be orange and 8% to be a lemon, while the other trees would have different probabilities associated.

In the end, a weighted vote will be computed over these probabilities coming to a final conclusion. That would be the basic procedure for a classification task like our fruit example.

Random forests can also be used in regression tasks, when we want to predict a continuous number or measure, as the price of a house or the number of products that are going to be sold in some specific period of time. The procedure, in this case, is basically the same, but the final prediction would be given not through a weighted vote, like in classification tasks, but instead by computing the mean of the individual tree predictions.

(Left) Photo by Piotr Cichosz on Unsplash | (Right) Photo by Louis Hansel @shotsoflouis on Unsplash

Just don’t forget that if you show a remote control to our fruit model classifier, it will definitely think it’s a banana. These models learn precisely what we tell them to learn and the quality of their predictions depends on the quality of the examples that we give to them.

In this article, we built a better understanding of how one of the most popular machine learning algorithms works behind the scenes. Now that the abstract idea of ​​algorithms mixed up with movie scenes in which machines take control of everything has left your mind, how about trying to write your first line of code?

Knowing that most parts of the articles available on Data Science are written in English, I created a project called Tech4All-br, with the purpose of creating content in Portuguese. If you want to read this material in Portuguese, access this link.

--

--