Random Forest — Mystery Revealed

Published in

Towards Data Science

4 min readApr 29, 2018

Selecting the ‘right’ machine learning algorithm for your application is one of the many challenges of appropriately applying machine learning in the real world. The right algorithm can sometimes be the one which is the most simple, the most robust and the most easily understood by your end users. Random Forest will regularly tick all of these boxes but, as others have written, is often overlooked. Hopefully this post provides you with a solid base understanding of how the algorithm works so you are more confident when conducting your own research or applying within tools like R or Python.

Random Forest is a supervised learning algorithm commonly applied in both classification and regression situations. As a brief introduction, the algorithm is an ensemble model, creating a ‘forest’ of many decision ‘trees’ with the number of trees defined by the user. Each decision tree is created based on a subset of attributes (columns) and observations (rows) from the original data set. These trees are grown using the training data set and applied to the test data set. The final classification returned by the model is the one which matches the classifications provided by the greatest number of individual decision trees.

As random forest is simply the classification returned by the largest number of trees, this blog will focus on decision trees themselves which are at the core of the algorithm.

What are Decision Trees?

A decision tree is a flow-type structure where each attribute in the data is represented as an ‘internal node’, each ‘branch’ represents the outcome of a test, and the ‘leaves’ represent the classification made. Each decision tree within the algorithm is created using a different, ‘random’ subset of attributes and observations from the original training data set.

[1]

This next section can be confusing (it was for me at least!) so hopefully my explanation is clear. Decision trees are created in Disjunctive Normal Form where the decision is an ‘OR‘ of ‘ANDs‘. This can alternatively be phrased as a dis-junction of one or more conjunctions of one or more literals. [2] The decision tree example above shows the first disjunction as Salary above ‘OR’ below $60k. This is also the internal node in terms of the structure of the tree. The branch representing ‘No’ is an outcome of this salary test, where ‘Reject’ is the classification. The second disjunction is work experience greater than 2 years. We again have a Reject classification, but in this instance we also have an Approve. To achieve the approve classification, Work Experience must be above 2 years AND (conjunction) the outcome of the salary > $60k test must be true/Yes. A short video has been embedded below which discussed disjunctive statements. Wikipedia also has an article on logical disjunction which includes venn diagrams to highlight the ‘OR’ property.

Attribute Orders and Splits

As mentioned, each level or dis-junction within the decision tree represents an attribute in the random data segment used for the individual decision tree. The ordering of these attributes is important when considering both the accuracy of the final model as well as the computational effort associated with the tree. In other words, attributes should be ordered in such a way to provide the most efficient and informative structure to classify future observations. In decisions trees, attributes are prioritised in terms of ‘information gain’ as a result of a dis-junction on an attribute.

The information gain from a child internal node is equal to the entropy (level of disorder) of the parent node minus the weighted average entropy of the child node. Weighting is based on the number of classifications made via each branch. Entropy is zero when there is perfect separation in the classification, i.e. all ‘True’ outcomes result in a ‘Good’ classification and all ‘False’ outcomes result in a ‘Bad’ classification. Entropy of ‘1’ (highest possible entropy) occurs when both the ‘True’ and ‘False’ outcomes return equal numbers of ‘Good’ and ‘Bad’ classifications’. The second example provides no information gain as there is effectively a random split of classifications between the two possible outcomes and the tree has gained no information. This would therefore be a poor attribute to use as the ‘root’ of the tree.

Entropy is a term which relates to the order or disorder in a system. High entropy relates to high disorder and vice versa. Below are two short videos from Udacity covering the concept of entropy and information gain. They form part of a free series which provides an introduction into the topic of machine learning.

To keep this post short I will end before covering concepts including ‘pruning’ of decision trees and over fitting which goes hand in hand with the pruning concept. I will cover both of these in a later post but hopefully the summary to this point has pulled back the curtains on the powerful Random Forest algorithm.

[1] *https://dataaspirant.com/2017/01/30/how-decision-tree-algorithm-works/

[2] Mendelson, E. Introduction to Mathematical Logic, 4th ed. London: Chapman & Hall, pp. 27, 1997.

Random Forest — Mystery Revealed

Written by Blake Lawrence