An Introduction to Random Forest

Illustration, interpretation, biases, and usage for outlier detection and clustering

Houtao Deng

Published in

Towards Data Science

6 min readDec 7, 2018

Last updated: 2018–12–09

Random forests are popularly applied to both data science competitions and practical problems. They are often accurate, do not require feature scaling, categorical feature encoding, and need little parameter tunning. They can also be more interpretable than other complex models such as neural networks.

The content is organized as follows.

What is a random forest
Interpreting a random forest
Bias towards features with more categories
Handling redundant features
Outlier detection
Clustering

What is a random forest

A random forest consists of multiple random decision trees. Two types of randomnesses are built into the trees. First, each tree is built on a random sample from the original data. Second, at each tree node, a subset of features are randomly selected to generate the best split.

We use the dataset below to illustrate how to build a random forest tree. Note Class = XOR(X1,X2). X3 is made identical as X2 (for illustrative purposes in later sections).

An illustrative dataset with two classes colored in red and blue.

The figure below demonstrates how to build a random forest tree.

The process of building a random forest tree.

The same process is applied to build multiple trees. The figure below illustrates the flow of applying a random forest with three trees to a testing data instance.

The flow (highlighted in green) of predicting a testing instance with a random forest with 3 trees.

Interpreting a random forest

Feature importance

A feature’s importance score measures the contribution from the feature. It is based on the impurity reduction of the class due to the feature.

Let’s add an irrelevant feature X4 to the illustrative dataset. The importance scores are plotted below. Clearly, X1 and X4, respectively, have the largest and smallest scores. X2 and X3 are identical and sort of split the importance scores.

Partial dependency plot

The importance score of a feature doesn’t tell how a feature and the class are correlated. The partial dependency plot can visualize the marginal effect of a feature on the class probability.

When a feature is correlated to the class, the plot looks like the left figure below, indicating that X1 ≤ 0.5 and X1>0.5 are associated with different classes.

However, in our illustrative dataset, the partial dependence plot looks like the right figure — it does not indicate any relationship between X1 and the class, even though X1 has the largest importance score.

The reason is that X1 has to interact with X2 or X3 to be predictive of the class. X1 alone is not predictive. Therefore, partial dependence plots can be misleading for this case.

Left: when X1 alone is correlated to the class, partial dependence is informative. Right: when X1 alone is not correlated to the class, partial dependence can be misleading.

inTrees

Neither importance scores nor partial dependency plots tell how multiple features interact with the class. The inTrees framework can be used to gain a clearer picture of what happens inside a random forest.

For the illustrative dataset, the highly-predictive interactions and their associated classes can be extracted with inTrees and shown below. The frequency (0%-100%) measures the popularity of an interaction in the random forest, and accuracy (0–1) measures how accurate an interaction predicts the class.

Bias towards features with more categories

For the illustrative dataset, let’s add a random feature X5 with 30 categories. Even the feature is irrelevant to the class, the importance score of X5 is larger than truly informative features X2 and X3, indicating an incorrect bias towards features with more categories.

X1, X2, and X3 are truly informative, X4 and X5 are irrelevant, and X5 has many categories.

One solution is to perform feature selection. For example, in the randomForest R package, one can use features’ impact on accuracy (importance$MeanDecreaseAccuracy) to evaluate features. The accuracy impact plot below shows X5’s accuracy impact is quite small compared to the truly informative features, indicating the feature is confusing the model and should be removed before fitting a classifier.

Handling redundant features

When features are similar to each other, the importance scores of these features can be misleading. In the illustrative dataset, X2 and X3 are identical and they “share” the importance scores (shown the left figure below). When there are more redundant features, the importance of each feature becomes even smaller.

This may not hurt the accuracy performance but could be misleading in interpretation. One solution would be the regularized random forest (RRF). In the tree building process, RRF memorizes the features used in previous tree nodes, and prefer these features in splitting future tree nodes, therefore avoiding redundant features in the trees. The right figure below shows the importance scores from RRF.

Left: feature importance from a random forest; Right: feature importance from a **regularized** random forest.

Outlier detection with random forests

Clustering with random forests can avoid the need of feature transformation (e.g., categorical features). In addition, some other random forest functions can also be used here, e.g., probability and interpretation. Here we demonstrate the method with a two-dimensional data set plotted in the left figure below.

The idea is to generate a random data set that contrast with the original data. Here we randomly permute each feature. The two data sets are labeled with two classes (say, “normal” and “random”), respectively. The combined data set is shown in the right figure below.

Left: original data; Right: generate a two-class data set. class 1: original data; class 2: the same size as the original data but with X1 and X2 randomly permuted.

A random forest is built on the dataset. Then the classifier can be applied to test data instances. If the predicted class is “random”, then it is identified as an outlier. The outliers identified are illustrated in the left figure below.

We can get insights into which features contribute to the outlier detection by looking at the importance score. For illustration, we add a random feature X3 that is irrelevant to the classes. The importance scores are shown in the right figure below. X1 and X2 are identified as important features, while X3 is less important.

Left: outlier detection. Right: feature importance score in outlier detection.

Clustering with random forests

Similar to outlier detection, clustering with random forests saves efforts in feature preprocessing.

The procedure is similar to outlier detection. First, create a synthetic dataset of the same size as the original data. Then label the original data and synthetic class with two different classes. A random forest is then built for the classification problem.

From the built random forest, a similarity score between each pair of data instances is extracted. The similarity of two data instances is measured by the percentage of trees where the two data instances appear in the same leaf node.

With the similarity scores, clustering algorithms such as hierarchical clustering can then be used for clustering. The figures below show the clustering results with the number of cluster pre-defined as 2 and 4 respectively.

Left: clustering with 2 clusters. Right: clustering with 4 clusters.

Summary

Random forests are powerful not only in classification/regression but also for purposes such as outlier detection, clustering, and interpreting a data set (e.g., serving as a rule engine with inTrees).

However, mistakes can be easily made when using random forests. Firstly, there can be a bias when multi-level categorical features exist in a data set. Secondly, importance scores can be misleading when features are redundant. This article provides solutions for the issues.

You may be interested in reading a relevant article on why random forests outperform decision trees.

You welcomed to contribute to the inTrees or RRF package on github.

Find more content in my book freely available at dataanalyticsbook.info.