The world’s leading publication for data science, AI, and ML professionals.

The Great British Baking Show: Random Forests Edition

Introducing random forests step-by-step so you can see every ingredient as it's combined

Photo by Rod Long on Unsplash
Photo by Rod Long on Unsplash

If only machine learning could be as delicious as it is oftentimes perplexing. While learning data science concepts, I’ve found that it’s a good idea to approach it the same way as learning Baking: start simple, grab the basic ingredients, and combine them together yourself one by one until you get a feel for it. APIs are a wonderful resource, but if all you know how to do is use them, it’s a bit like buying muffins and a can of frosting to make cupcakes and calling yourself a baker.

Welcome to the tent.

For your first technical, Paul and Prue would like you to make a bread vs. cake classifier

You’re probably thinking: I’ve had cake before. I’ve eaten loads of bread. I don’t need some hoity-toity algorithm to tell me the difference.

Think again.

Banana bread. Babka. New England cornbread. Little brioches in cupcake wrappers with a light dusting of icing sugar and a chocolate straw. Daunted yet? You should be. The wide and diverse world of baking is the kind of stuff Machine Learning was made for.

Let’s start by gathering some training data. As with any good classifier, you need to make sure you have both samples that are obvious exemplars of their class and samples that are somewhat on the edge. I’ve color-coded it so that yellow is bread and blue is cake.

You don’t get a perfect tree in the forest

That’s a quote from the ever-chill Selasi justifying his matte chocolate finish, but it applies so well to random forests as well. The power of a Random Forest comes from the fact that each tree is never allowed to be perfect – this is due to bagging (bootstrapping + aggregating), which means that both the samples and the features for each tree in the forest is only a subset of the full set of samples and features. In the case of samples, they’re bootstrapped, which means the samples for each tree is drawn with replacement, so duplicate samples can be used for each trained tree.

Since no single tree sees the whole set of samples or features, but the final answer is aggregated from the majority vote of all trees, random forests guards against overfitting while taking advantage of the "wisdom of the crowd" effect.

Get ready, get set…BAKE

We are going to show what happens during the training of one random forest, which will end up having just three trees made from four training samples and three features.

For our first round of bagging, we bootstrap the samples. Note how one of the bread samples, corn bread, ended up getting duplicated because we chose 4 samples (the size N of the total sample set) randomly with replacement.

We also select a proportion of the features, highlighted in purple.

Now flour ratio and frosting have to compete to be the top node in the first tree by being the best separator of B and C. The values of the leftmost feature, flour ratio, spans from 0.24 to 0.4. We take the smallest value of it and first try a split at FR > 0.24.

Amazingly, we get a pure split on our first try! FR > 0.24 for flour ratio perfectly separates breads from cakes, which makes for a Gini score of 0.0 at each split (see below for an explanation of what this is). Let’s see what happens with the next split, FR > 0.38.

This split generates impure splits because the top split has 2 B’s and one C, even though the bottom split has just the one C. The top split would end up having a Gini score 0.444, which is very impure, indeed.

If we look at the only available split for the next feature, F==N, you see that it’s just as impure.

Thus, we go back to our first lucky split, FR > 0.24, and make our first tree:

Since it produces two pure nodes, we’ve hit a stopping criteria. After all, if your node contains["C", "C"], how do you split that any further? We ship the tree off to the forest and move on to the next one.

Commercial break: What’s a Gini score?

def _prob(s, j):
    return len([x for x in s if x == j]) / len(s)
def gini_score(s, classes):
    return 1 - sum([_prob(s, j)**2 for j in classes])

A Gini score is not a classification score like precision or recall because it doesn’t care if you got anything right or not, it only measures impurity, or how balanced a group of different classes are in a set. When classes in an array are pure, it doesn’t matter which classes they are, it will give the same answer:

classes = ["B", "C"]
gini_score(["B", "B"], classes)
> 0.0
gini_score(["C", "C", "C"], classes)
> 0.0

However, when it gets to be impure:

gini_score(["B", "C", "C", "C"], classes)
> 0.375
gini_score(["B", "C", "C"], classes)
> 0.44444444
gini_score(["B", "C"], classes)
> 0.5

We won’t go into it now, but the Gini score assumes class balance here, and to some extent, imbalance can be mitigated by using a weighted Gini score.

Bakers! You have one hour left! You’re halfway through

The next tree begins the same way as the last: with bagging. If you look carefully, you’ll see that the set of samples this round is slightly different because every round of bootstrapping is random. Instead of two corn bread samples, we have two devil’s food samples.

As with last time, we randomly select two features and have them compete for best splitting power of the bootstrapped samples.

As with last time, let’s try the first split of the leftmost feature, SR > 0.23.

This is an impure split, with Gini scores of 0.444 and 0.0. Let’s try the next split, SR > 0.25.

This is unfortunately also impure, with Gini scores of 0.5 and 0.0. You can see that this is also the case if you try to split by frosting.

It seems that, despite being imperfect, SR > 0.23 wins by a narrow margin. However, it doesn’t just stop there. We can use the splits of the first split to make more splits, creating the branching tree structures popularly associated with this algorithm.

The top split didn’t need more splits: it only had one sample. The second split could now be split by the other feature, frosting, by F==N. It turns out that when you do this kind of branching split, you now have pure splits at the leaf nodes. The resulting tree looks like this:

Bakers! Your time is up. Please step away from your forest

The wisdom of the trees

We’ve manually trained two trees together, but let’s add in another tree for good measure and take a look at the forest we’ve created.

We want to use this forest to see what we want to predict for a sample that has no label on it yet. This sample is very closely based on one of our original samples:

What does the classifier think, cake or bread?

Let’s take the first tree. It needs to look at whether FR > 0.24. At 0.61, it is quite a bit larger, so we classify it as cake, since the tree terminates there.

The next tree (this is actually a tree trained off-camera) also just looks at one feature, frosting. Our sample has no frosting, so this tree says it’s bread.

Our last tree starts by looking at sugar ratio, and we can see that it falls to the left side of the split on the tree, since 0.26 > 0.23.

That node is split by frosting, and F==N on the unlabeled sample makes it go to the left split, finally classifying this sample as bread.

The judges deliberate and decide

Huzzah! We have our answer. Since this sample was based off of brioche, it makes intuitive sense, despite the absolutely appalling number of samples and features. Also, as a special bonus, note that if you had one of those cupcake brioches with the frosting sugar, it would in fact have turned the predicted label over to cake 😉

Conclusion

I hope this recipe for a random forest has inspired some of you to try it yourself at home. See you next time in the tent!


Related Articles