
There exist a vast amount of great articles describing how Bagging methods like random forests work on an algorithmic level and why Bagging is a good thing to do. Usually, the essence is the following:
"You train a lot of decision trees on different parts of the training set and average their predictions into a final prediction. The prediction gets better, because the variance of the random forest is smaller compared to the variance of a single decision tree. (dartboard.png)"
- some article
Of course, I am paraphrasing here. The articles include great pictures, code, and many more thoughts. But what I often miss is a good intuition on why Bagging is a good idea and how to see the variance reduction in action, using a real dataset.
Thus, in this article, I want to address both of these shortcomings and give intuitive reasoning why the random forest algorithm works and how you can see the improvement in the variance graphically. You can consider this article an exploration of these two topics that goes deeper than the average article about the Bias-Variance Dilemma while being not as deep as a fully-fledged research paper. Still, I will provide links to resources that I find helpful, so you can go into greater depth whenever desired.
I try to keep the math level quite understandable to allow people without a mathematics major to follow along, while also giving some high-level ideas and illustrations that also mathematically involved people can enjoy.
Still, I will not explain how decision trees, random forests, and all the other models mentioned work in detail since this has been covered numerous times already, as described. I will only explain the very high-level ideas, starting with decision trees.
Decision Trees
Disclaimer: I will only talk about vanilla decision trees here. We do not consider pruning in the rest of this article. The trees can grow arbitrarily deep.
Output Description
A decision tree with k leaves is a model of the form

meaning that a decision tree is a piecewise constant function with a real value w in a region R of the feature space. Here x is from a feature space X and y is the corresponding label from an output space Y. The constraints on the R‘s are that

- they are rectangles with boundaries parallel to the coordinate axes of the feature space
- the set of all rectangles is a partition of the feature space, i.e. if you take two rectangles that do not intersect and the union of all rectangles is the complete feature space.
Now, that we have established that, let us examine why a decision tree is called a high-variance algorithm, while for example linear regression is considered low-variance.
Comparison with Linear Regression
For a quick recap, linear regression models have the following form:

where the weights w are real numbers and d is the dimension of the samples, i.e. the number of features.
For comparing the variance of these models, we have to take one step back and think about what a learning problem actually is.
Usually, we are given a fixed amount of samples (learning set, training samples), let our algorithm do some magic and fit all the necessary parameters and in the end, we can predict values for unseen samples. However, this is a quite rigid view of things.
In Learning Theory, we model the training set as coming from a distribution D over the space X×Y, where X is the feature space and Y is the output space. We sample a training set L (and also a validation and test set) of size n from the distribution:

Imagine a distribution to be a black box with a button; if you hit the button once, you get a random sample (x₁, y₁) from the distribution. Hit it again and you get another sample (x₂, y₂), independent of the one(s) before. Repeat, until you have enough samples to work with.
Then we can use the n data points from L to train our model. This outputs a function f with f(xᵢ)≈yᵢ for all (xᵢ, yᵢ) in the sample L (if our model is any good). This approach ensures that the performance of the model is okay on the training set.
But now imagine that we query n new samples from the distribution D and use these as a training set L’. Let us call the model resulting from the training on this new set g. This new model g will also meet the condition g(xᵢ’)≈yᵢ’ for all (xᵢ’, yᵢ’) in L’.
Now, since L’ consists of different points (xᵢ’, yᵢ’), the new model g will have a different output shape than f. The models f and g might or might not differ a lot, depending on how different L and L’ were, how the models were created and which randomness the algorithms used internally.
If for a fixed algorithm (e.g. "decision tree") the models for different training sets L and L’ tend to differ a lot, we call this algorithm a high-variance one.
Of course, this is no precise definition, but this is also unnecessary for this article. In the following, we will use graphics to determine if an algorithm has a higher variance than another algorithm.
If you are interested in the mathematics (cheers!), I can recommend the dissertation of Gilles Louppe [1], and the book of Shai Shalev-Shwartz and Shai Ben-David [2] which explains the theoretical foundations of Machine Learning in great detail.
Let us get back to our comparison of decision trees and linear regression. We will use the following running example: X=[0, 10] and Y=ℝ, i.e. the feature space is of dimension 1 and this one feature **** can take real values between 0 and 10, while the labels can take any real value.
In our example, we define a distribution D doing the following: the feature x is chosen uniformly from 0 to 10 and the label y is computed explicitly via the hidden function


The function h describes the underlying structure of the labels, it is the truth that we want to learn about the labels. We call it hidden since we will not give this information to the algorithms. They have to figure it out on their own. 🙂
Following the reasoning above, if we query our distribution D three times for 10 samples each time we might end up with the following three training sets:

Let us use the rightmost training set and plot the results after applying decision trees and linear regression.

Bias and Variance of Decision Trees and Linear Regression
Let us conduct the same experiment 3000 times for 3000 independently sampled training sets, each of size 10 again. On the left side, we see the results of the decision trees and on the right side, there are the linear regression results stacked on top of each other.

Here we can see that the decision trees (left side) fit the data quite well on average. People also refer to this property as decision trees having a low bias. Meanwhile, for linear regression on the right side, the model clearly can not capture the complex pattern of the underlying label structure. We say that linear regression has high bias, in this case, it is not able to learn the truth.
Yet, if you consider the vertical width of these black tubes, the ones stemming from the decision trees are wider than the linear regression ones on the right. This means that the decision tree predictions wiggle around more extremely than the linear regression predictions when re-sampling the training dataset, which we refer to as decision trees having a high variance and linear regression having a low variance.

What we actually want are algorithms with a low bias (they hit the truth on average) and low variance (they do not wiggle around the truth too much). Luckily, there are numerous ways to lower the bias (e.g. with a technique called Boosting) and also other ways to lower the variance. The latter can be achieved with the so-called Bagging. The good thing about Bagging is, that it also does not increase the bias again, which we will motivate in the following section.
That is why the effect of using Bagging together with linear regression is low: You can not decrease the bias via Bagging, but with Boosting. The funny thing is that it has proven useful to choose decision trees together with Boosting, too. In this case, heavily pruned decision trees, which also have a lower bias, are used.
Bagging
In this section, we will see what Bagging does, why it works, and how to see the decrease in variance.
An easy Motivation
Imagine we have access to the standard normal distribution, in particular, the mean of an observed value is 0 and the variance is 1. Let us assume that we love to see values around 0 (just as we love to see prediction functions around 3sin(x)+x). But the variance of 1 is too large for our taste (just like the width of the black tubes) and we search for a way to decrease it. An easy way to do it is to sample more values from the standard normal distribution and take the average of them. The following result is well-known and easy to verify:

Thus, by averaging, we simulate drawing from another normal distribution with the same mean, but with a smaller variance, if ρ is not too big. This is great because we get values closer to zero with a higher probability than before!
In the special case of independent random variables (ρ=0) and b=100, the variance drops from 1 to 0.01, for example. The result is the following:

Attention: If the random variables X are all correlated with value 1, this implies that ρ=(b-1)/b, i.e. the variance of the average will be 1 again. This corresponds to the case where each sample is, in fact, the same number. Averaging over a lot of the same numbers does not give us any new information, so this is as good as drawing only a single value.
In the best case, we can average independent samples. The more correlated they are, the more useless they become in the averaging process.
The Core Idea of Bagging
Now, the useful insight is that we can do the same with prediction models. Running a decision tree algorithm on a randomly drawn training dataset gives us a model, which is essentially sampling a function from a distribution. Averaging these models gives us another model (e.g. a Random Forest) with the same bias, but with lower variance. This ensemble model is closer to the truth than a single decision tree on average.
But the question is: How badly are these functions correlated? Consider the following: If we come across a dataset, we can fit a single decision tree on it. So far, so good. But, if we do it again, the result will (nearly) be the same in the case of decision trees. This means that the functions that we sample this way are highly correlated (ρ≈1) and do not improve upon a single decision tree.
It is not necessarily exactly 1 since the decision tree algorithms occasionally have to break ties which can be done in a random manner, but since this is the only source of randomness it does not produce trees that are fundamentally different from each other.
Somehow we have to decorrelate these trees, and we will see how to do this in the next section.
Towards Random Forests
Random forests were invented by Leo Breiman [3]. The idea here is to fit numerous decision trees on the training set in a special way, giving an equally large number of tree models (=functions). Afterward, these trees are combined into a single model, e.g. by averaging their outputs for any given input x, making it a special Bagging method. This results in a model with lower variance, similar to what we have seen before with normally distributed random variables.
The idea behind getting many and not maximally correlated trees is the following:
- Use a random subset of the training samples for each tree.
- Use a random subset of features in each step of growing each tree.
Having two sources of randomization helps to reduce the correlation between different trees even more than using only one of them. Feel free to add more, if you happen to design a new Bagging algorithm! There are various other methods to combine single decision trees, for example Extremely Randomized Trees by Geurts et al. [4].
Comparison of the Variance between Decision Trees and Random Forests in one Dimension
Let us draw 10 samples from our distribution again and fit a decision tree and a random forest containing 100 decision trees. We repeat this procedure 1000 times and get the following picture:

We see that the vertical width of the red tube, formed by the random forests is smaller than the decision trees’ black tube. So, random forests have a lower variance than decision trees, as expected. Furthermore, it seems that the averages (the middle) of the two tubes are the same which means that the process of averaging did not change the bias. We still hit the underlying true function 3sin(x)+x quite well.
Note that the random forest algorithm could not show its full potential here since we have used a dataset with only one feature that every single decision tree has to use. So the 100 decision trees inside the random forest can only differ among the training samples that were chosen to grow each tree. In this case, the random forest algorithm collapses into an easier Bagging algorithm that only uses different training samples for each tree.
If we want to widen the gap in variances while still being able to interpret the results visually, we have to move to a 2-dimensional feature space. This allows the random forest algorithm to choose exactly one of the two features available at random at each step within the algorithm.
Comparison of the Variance between Decision Trees and Random Forests in two Dimension
Let us define a distribution for the training data, which is similar to the distribution we used in the 1-dimensional case. We choose X=[0, 10]² and Y=ℝ, where D samples an (x, x’) uniformly from the square with the vertices in (0, 0), (0, 10), (10, 0) and (10, 10) and

A random dataset comprising 50 points can look like this:

Now, let us look at how the variance behaves for decision trees and random forests in this context. Enjoy the results!
Let us first start with the case of decision trees. We use 9 different training datasets for growing 9 different trees.

Looks familiar? 😉
Now, let us do the same with random forests. Here, we train 100 decision trees per random forests again on a different subset of samples and use only one of the two given features at random! Each model is trained on 50 random sample points.

Conclusion
It has become evident that high-variance algorithms change their outcome (the model) rapidly when the training set changes. This is bad since we can never know how far away our concrete model is from the truth, even if the bias of our model is zero.
But we learned how to increase our chance of getting a good model using Bagging. We have also gotten an intuition on why Bagging lowers the variance while leaving the bias unchanged, and we have seen these results in a lot of illustrations.
References
[1] G. Louppe, Understanding Random Forests – From Theory to Practice (2014), Dissertation
[2] S. Shalev-Shwartz and S. Ben-David, Understanding Machine Learning: From Theory to Algorithm (2014), Cambridge University Press
[3] L. Breiman, Random Forests (2001), Machine learning 45.1 (2001): 5–32
[4] P. Geurts, D. Ernst and L. Wehenkel, Extremely Randomized Trees (2005), Machine learning 63.1 (2006): 3–42
I created all formulas using LaTeX. For the other graphics, I used the Python library matplotlib together with numpy. For the model training, I used the scikit-learn.
Acknowledgment
My thanks go to Dr. Patrick Bormann for proofreading and for giving a lot of helpful advice on improving my article. Also thanks to Andre Esser for his help!
Bonus: Code for the Mosaic
Please think of me when you are starting to generate money with this decision tree mosaic art. 😀
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
import numpy as np
# Sample from the distribution with a true function f.
def generate_data_2d(f, n_samples):
x1 = np.random.uniform(0, 10, n_samples)
x2 = np.random.uniform(0, 10, n_samples)
y = f(x1, x2) + np.random.randn(n_samples)
return np.vstack([x1, x2]).transpose(), y
# Parameters to play round with.
f = lambda x1, x2: 3 * np.sin(x1 + x2) + x1 - x2
n_samples = 50
n_rows = 3
n_cols = 3
# Increase numbers to remove white spaces in the pictures.
n_points = 100
size_points = 6
# Prepare the plotting.
fig = plt.figure(constrained_layout=True, figsize=(12, 12))
all_points = np.array([(x1, x2) for x1 in np.linspace(0, 10, n_points) for x2 in np.linspace(0, 10, n_points)])
# Start plotting.
for i in range(1, n_rows * n_cols + 1):
# Get a random training set.
x, y = generate_data_2d(f, n_samples)
# Train a decision tree.
dt = DecisionTreeRegressor()
dt.fit(x, y)
predictions = dt.predict(all_points)
# Create one mosaic picture.
ax = fig.add_subplot(n_rows, n_cols, i)
ax.axis('off')
ax.scatter(all_points[:, 0], all_points[:, 1], c=predictions, s=size_points)
I hope that you learned something new, interesting, and useful today. Thanks for reading!
As the last point, if you
- want to support me in writing more about machine learning and
- plan to get a Medium subscription anyway,
why not do it via this link? This would help me a lot! 😊
To be transparent, the price for you does not change, but about half of the subscription fees go directly to me.
Thanks a lot, if you consider supporting me!
If you have any questions, write me on LinkedIn!