
There is something about random forests that is deeply unique.
Our usual variance reduction techniques – like limiting the depth of a decision tree, or adding a norm penalty to a logistic regression – work by penalizing the complexity of a single hypothesis. The variance reduction in a random forest, however, comes from simultaneously exploring **** _multiple competing hypothese_s.
Note the word "competing" here. Diversity among individual predictors is key to the performance of ensemble models like random forests. And in the language of Statistics, "diversity" corresponds to the uncorrelatedness/anti-correlatedness of the predictions of individual models.
Let’s do something few machine learning classes do and dive deep into the statistics of random forests. To do this, we’ll derive some of the results in the seminal Breiman paper in a reader-friendly way, and ultimately answer the meaty question: So what?
Defining Our Metric
Recall that the output of a random forest classifier comes from taking the proportions of individual trees voting for different classes and picking the class with the largest corresponding proportion. Therefore, a random forest that correctly classifies a point will have its maximum proportion (a quantity between 0 and 1) belong to the correct class.
Let’s attach a function to our random forest called the margin function; this function takes a single (X, Y) pair from our dataset’s distribution, and outputs the difference between the forest’s proportion of votes for the correct class Y and the forest’s maximum proportion among all incorrect classes.
All you need to know about the margin function is that a larger positive margin at a point means greater cushion room in the forest’s "correctness"; a margin value below zero, however, means that the forest straight up misclassifies the point (with more negative margins being worse.)
Example: A margin of 0.5 at point (X, Y) means that the forest correctly picked the class Y, with 50% more total votes at Y than at any other incorrect class. Conversely, a margin of -0.3 means that the forest incorrectly classified the point, with 30% more total votes at the incorrectly picked class than at the correct class Y.
Note: For the rest of the article, we’ll work with random forests that have a very large number of trees, i.e. random forests in the limit, so that their outputs are effectively not random once you’ve fixed your training dataset. This does not change any of our conclusions since even moderately large forests quickly converge to their limits, but it does simplify our analysis.
Now let’s define our classification metric: the probability of sampling a training dataset of points D, and independently sampling a single evaluation point (X, Y) from the same distribution, such that the forest fitted on D has a positive margin on that (X, Y) point:

Again, a positive margin at a point means that the forest correctly classifies the point, so the probability above is totally equivalent to the population % accuracy of our forest-generating procedure. (If you’re struggling to see this, try this more explicit reframing: If you sample a validation set from the population distribution and the probability of a correct classification on any randomly sampled point is Q, then the % accuracy on the validation set is roughly Q.)
Now here comes a phrase you may not have heard in a long while: the Chebyshev Inequality.
Statistics Power Move #1: You can (quadratically) bound the tail probabilities of any random variable using only its expectation and variance.
With Chebyshev’s Inequality, we can calculate a concrete lower bound on the probability of sampling a sub-zero margin with just the expectation and variance of the margin function. Intuitively speaking, a margin with low variance and large positive mean will rarely take on sub-zero values, and Chebyshev just makes this rigorous. For now, just know that we can tweak and trade off mean and variance to get a larger lower bound on the probability expression above (our population accuracy.)
Bias vs. Variance at a High Level
Since expectation and variance of the margin function are key in our accuracy trade off, it behooves us to get an intuitive understanding of the two.
Let’s look at each of these terms, starting with expectation:

A larger expectation means that any forest we generate is usually correct with a larger margin than it is incorrect. From our knowledge of probability and Chebyshev’s Inequality, a larger expectation implies a larger lower bound on the probability of sub-zero margin values, i.e. a larger lower bound on the population accuracy.
One slightly lowers the expectation above by randomizing individual trees with methods like bootstrap sampling or feature subsampling. This is one of the trade offs that a forest makes: marginally increasing the "bias". But what do we gain by increasing bias?
This is where variance comes in:

Smaller variance for a fixed positive mean implies that the forests we generate are less likely to have a sub-zero margin on any sampled point. As with a larger expectation, a smaller variance increases the lower bound on the probability of sampling a sub-zero margin.
Note that we can’t just build infinitely many trees to shrink our variance down to zero; if all of our trees are identical, our random forest is indistinguishable from a lone decision tree no matter how big it is. Clearly, having differences between trees matters quite a bit. What is the rigorous quantity representing "difference between trees" here?
As it turns out, covariance between tree errors is all we need. (Note that covariance is just a scaled correlation.) But how? And where does this covariance term come from? We’ll need to break down the variance term to find out.
Breaking Down Variance
We can use the moment definition of variance to expand the variance of the margin function as follows:

Let’s zoom in on the first expectation term on the right hand side.
Remember, the margin function at (X, Y) is the difference between the tree’s proportion of votes for the correct class Y, and its largest proportion of votes among all incorrect classes. And we can really represent the proportion __ of total trees voting for some class as the probability of a randomly selected tree voting for that class. (Recall that a forest is grown from a particular dataset through independent and identically distributed draws of individual trees.)
This connection between forest vote proportions and sampling probabilities brings us to our second statistics power move:
Statistics Power Move #2: All proportions – or probabilities – can be rewritten as the means of binary/indicator/Bernoulli random variables.
To reiterate, we can write a forest’s proportion of votes for a given class C as the mean over each tree’s binary vote for class C – where the vote is 1 if the tree voted for C, and 0 if it didn’t. In the limit (i.e. with infinitely many randomly generated trees), this "mean over trees" becomes an expectation, taken over the distribution from which we are randomly sampling individual trees.
Let’s rewrite our margin function using this tree distribution and our frequency identity:

Quickly note that I’ve conditioned the expectation on the training dataset D and fixed evaluation point (X, Y), since the margin function on the left is evaluated on a fixed D and fixed (X, Y).
In this substitution, we’ve taken the difference in proportions of trees spitting out either class Y or K for input X (this is our original definition of the margin function evaluated at X), and turned it into the difference in the averages over a randomly sampled tree’s binary vote for class Y or K (the figure above has an average of differences, which is equivalent to a difference of averages.) To make the expression more compact, let’s call the difference of binary votes within the above expectation a "raw" margin function, or "rmg", associated with an individual tree.
We can write the following new identity by first squaring both sides, then applying an expectation over D and (X, Y), and finally substituting the more compact raw margin function in place of the difference of binary votes:

Why did we do this? If you recall, the left-hand expectation is the second moment in our formula for variance in Fig. 4. We are well on our way to rewriting the variance term.
Now we’re going to pull a covariance expression out of the right-hand term above. Get excited!

Before we can do this, we’ll need to use three short yet clever tricks. The first trick is to rewrite the right-hand expectation in the expression above by introducing two independent, identically distributed tree distributions: "Trees" and "Trees-prime". Since these distributions are i.i.d., their expectations will be the same, so we can substitute the product of two (identically valued) expectations over "Trees" and "Trees-prime" in place of the squared expectation inside the right-hand term. Then, we get the following rewriting of the right-hand term:

Note that the raw margin function is a unique property of a particular tree (just as the margin function is a unique property of a particular forest), so I’ve delineated the raw margin function of the tree sampled from "Trees" as "rmg" and the raw margin function of the tree sampled from "Trees-prime" as "rmg-prime".
The second trick uses the fact that the product of expectations of two independent random variables is equal to the expectation of the product. Since the distribution "Trees" and its twin "Trees-prime" are independent by construction, the evaluated "rmg" and "rmg-prime" functions in the expression above are themselves independent over draws of trees (upon fixing the evaluation point (X, Y) and dataset D generating our trees.) We can then use the second trick to substitute a joint expectation on the product of the two raw margin functions in place of the product of expectations. As a result, we can rewrite the expression as:

Finally, the third trick is to "permute" the expectations so that you have an expectation over training dataset D and the tree distributions on the outside, and an expectation over the evaluation point (X, Y) on the inside. To see how this works, it’ll first be helpful to consolidate some terms. Using the law of iterated expectation, we can rewrite the iterated expectation above as a total expectation:

Now, we’ll use the fact that (X, Y) is independent with the joint distribution over "Trees", "Trees-prime", and D to factor the above expectation into a new iterated expectation:

We are extremely close to pulling out a covariance term here (which, again, is just a scaled correlation!) Note that the term inside is the expectation of a product of two random variables, which is a component term in one of the textbook formulas for covariance: Cov(X, Y) = E[XY]- E[X]E[Y]. Let’s rewrite the term inside the expectation in terms of covariance to get a new overall expression (apologies for the font size in the figure below; I promise this as long as the expressions will get):

Recall, from Fig. 6, that we’ve been calculating the second moment of the margin function all along. We can then substitute the above expression into the second moment in the moment definition of variance in Fig. 4. I’ll save some pages of repetitive math here; just know that we can consolidate the second term inside the above expectation with the second term on the right-hand side in Fig. 4. Making this substitution-and-consolidation gets us the following expression for the variance of the margin function, one that only involves the above covariance term (wrapped in an expectation) and an additional variance term:

These are the two components that make up the variance of our forest’s margin function over D and (X, Y). Let’s intuitively understand each term:
- The term on the left: Roughly speaking, this term represents how much the "errors" of any two randomly sampled trees "co-vary" over the population distribution of points, where this covariance is averaged over all possible combos of trees and training datasets producing those trees. Note that having one tree’s point-wise errors be uncorrelated, or even opposite to, another tree’s errors reduces the overall value of the expression.
- The term on the right: This term is a lot less important since, for large training datasets D, the variance of the forest’s expected margin function over different training datasets D is tiny (another way to say this, is that two large random forests fitted on two i.i.d. large datasets will make very similar predictions.) We can then reasonably assume that the term on the right is much smaller than the term on the left.
Let’s eliminate the term on the right and rewrite the variance of the margin function using just the term on the left:

Amazing! The bias-variance tradeoff of our forest is really a bias-covariance tradeoff, with the covariance in question being that of the errors of individual trees.
And the sneaky reader may have noticed something even more mind-blowing: no part of this math is exclusive to ensembles of trees. Did you hear me mention "leaf", "node", or "split" anywhere?
That’s right; you can randomize just about any classifier and average them together to achieve an optimal bias-covariance tradeoff – one that is likely to get you better accuracy than any one of those classifiers alone! Wild.
An Investing Analogy

In traditional portfolio theory, one goal of investing is to minimize the variance of your investment returns for a fixed average return.
Suppose your portfolio consists of a bunch of investments in various companies: Google, Coca-Cola, Reddit, etc. Then, the cumulative return of your investments is the sum of the individual returns. We can write the variance of your cumulative returns as:

Let’s walk through both of the sums on the right.
The first sum on the right consists of the sum of the variances of each individual stock’s returns. It makes sense that the more a certain stock’s returns fluctuate, the more the overall portfolio returns also fluctuate. You can think of the variance of the return as a crude measure for the risk __ of an investment.
The second sum on the right is a little more interesting, and contains covariance terms (can you see where I’m going with this?) If two stocks co-vary negatively, it actually decreases the overall variance of the portfolio’s returns.
This is why analysts always tell individual investors to diversify their investments. If you only hold stock in Gamestop, then your fortunes will swing as wildly as the fortunes of Gamestop. But if you hold stocks in a variety of different companies across different industries, then the individual ups and downs will to an extent cancel out, and the only risk you face is the risk of the entire market going down (what is called a "systematic" risk.) Diversification is the investing equivalent of not putting all your eggs in one basket.
Back to random forests – our individual trees (stocks) can sometimes be too speculative, and chase training accuracy at the cost of generalization. However, by randomizing the trees, we are exploring (investing in) diverse, lesser correlated "sectors" within the problem, and by taking the average we are making the output (our portfolio) less dependent on any one tree. By doing this, we create a robust yet nonetheless powerful model.
Conclusion
Let’s summarize our two main learnings:
- Random forests employ a unique method of variance reduction: ensembling individual models that have less correlated errors
- There is nothing unique to tree-based ensembles. You can construct a random ensemble of any machine learning model, tweaking the randomization to get to an optimal accuracy
One thing to note is that gradient boosting may be doing something implicitly similar to the first bullet. With every new boosting iteration, we effectively upsample the "errors" of the current iteration and train a new model; finally, we add the new model to the existing collective. Because we upsample the previous iteration’s errors, the incremental models will tend to have a relatively lower error on those points while having a relatively higher error on points that previous iterations did well on. This is anti-correlation in action! Note, however, that no rigorous proof was presented here; I will leave the discussion on boosting vs. random ensembles to actual experts.
The second bullet is practically interesting. It suggests that every model in existence can be improved simply by randomizing and training many similar models in parallel – something that is increasingly becoming easier to do with cheaper cloud compute.
Who knows? Maybe the next big Kaggle competition will come down to a clever randomization technique.
References
[1] L. Breiman, Random Forests (2001), Machine Learning 45, 5–32