Random Forest — The Democratic Leader of The Data Science World (2/2)

Jay Trivedi
5 min readJun 12, 2017

This is the part two of a two part series. In part 1, we have explored briefly how decision trees work and where they fall short . In part 2, we will look at how random forest overcomes these shortcomings.

Quick Recap:

We have explored following problems with decision trees using H2O (with its Python 3.6 client) on two datasets — for a classification task and a regression task

  • Overfitting by showing that the difference between the performances of the algorithm for training data and validation data increases as model complexity (tree depth) increases
  • Instability by showing that significant changes in the tree structure happen with slight variations in inputs.

Random forests can overcome these problems through ensembling. We test this hypothesis using the same two datasets, Titanic and House Prices.

Random Forest Resists Overfitting:

I trained 30 Random Forests, each with 200 decision trees ( ntrees ) and max_depth varying from 1 to 30. Following are the depth versus performance graphs for both the datasets. Each graph contains performances for a single decision tree as well as random forest.

Titanic:

As we can see

  • In case of RF, accuracy of the training set is consistently lower than a single decision tree with increasing tree depth.
  • Accuracy-gap (between the training set and the validation set) for a single decision tree (denoted by vertical distance between red-purple lines) increases rapidly as tree depth increases, in contrast to the accuracy-gap for random forest (denoted by vertical distance between blue-green lines).

Here is the code snippet to reproduce the results.

House Pricing:

Similarly,

  • Error-gap between the training set and the validation set for a single decision tree (denoted by vertical distance between red-purple lines) increases rapidly as tree depth increases, in contrast to the Error-gap for random forest (denoted by vertical distance between blue-green lines).

Code snippet:

Takeaway:

For both datasets, the observations point to the fact that random forests manage to resist overfitting better than a single decision tree as tree depth increases.

Random Forest Are Less Unstable:

Next, let’s see if random forests tend be more stable than a single decision tree. Here, we are changing the value of seed in each iteration to make sure that a different portion of the sample is selected every time while training of random forest.

Following are the observations for both datasets.

Titanic:

Titanic: Consistent variable importance in random forest
Titanic: Inconsistent variable importance in decision tree

The first table shows the top 3 most important features as estimated by random forest. In all iterations, RF come up with the same three features, in the same order of importance.

However, the second table shows top 3 most important features for a single decision tree. We can see that the variables vary significantly from tree to tree (as each tree utilises a slightly different set of data to train itself), not only in terms of their order of importance but also features themselves.

Code Snippet:

House Prices:

House Prices: Consistent variable importance in random forest
House Prices: Inconsistent variable importance in decision trees

In case of House Prices data set, random forest tends to be more stable than a single decision tree as well.

However, in case of House Prices data set, random forest seems a little less stable compared to the Titanic dataset as variables change quite a lot in the third position.

We can tackle this issue with hyper-parameter tuning but it is currently out of scope of our discussion.

Code Snippet:

Why (and How) Random Forest Outperforms Decision Tree:

Random forest outperforms the decision trees because two techniques called bagging and subspace sampling. Both of them help random forest achieve one single philosophy — ‘wisdom of crowds’.

Rather than depending on just one learner, it ensembles predictions made by many weak learner. Weak learners are algorithms whose predictive power is at least slightly better than random chance. Since, it is computationally cheaper to train less complex model, which coincides with a weak learner, many such learners can be trained to obtain a collectively strong learner.

Moreover, it it necessary to make sure that the decision trees in a random forest have skills but at the same time remain sufficiently weak. There are a number of ways of constraining trees by controlling features like maximum depth, maximum nodes, minimum sample size per node etc, or through pruning. May be we will discuss these some other time!

Do It Yourself:

Let’s try making some changes in our random forest model:

  • Let’s set the value of sample_rate to 0.5 as opposed to 0.7 or 1 . This means that each tree of the random forest will train on even smaller sub-sample. Training a single tree on a smaller sub-sample typically makes it less stable.
  • Set the value of col_sample_rate_per_tree to 0.4. Which means that each decision tree in the random forest will only consider 40% of the features. Utilising only a fraction of available feature may make a single decision tree more prone to overfitting.

I would encourage you try these changes out yourself and observe the difference in the performance. Did you see any changes? Can you explain why they happened? You can mention them in the comments.

Final Thoughts:

In this post I illustrated how random forest overcomes the issues of overfitting and instability. I hope it was useful to the readers.

I am planning to write more blog posts along these lines. Any constructive comments regarding content and structure are most welcome.

I would like to thank Soumendra for his editorial inputs.

Until next time.

--

--