(XGBoost)
An intuitive explanation of GBT using the MNIST database
Hi everyone, welcome back to another article in the Visual Guide to Machine Learning series! We’ll learn yet another popular model ensembling method called Gradient Boosted Trees. If you haven’t already, check out the previous article to learn about Random Forests, where we introduce the concept of model ensembling and Decision Trees, the building blocks of these models. Check out the video down below for a 4-minute walkthrough of this article as well as links to previous videos in the series!
We’ll use gradient boosted trees to perform classification: specifically, to identify the number drawn in an image. We’ll use MNIST, a large database of handwritten images commonly used in image processing. It contains 60,000 training images and 10,000 testing images. Each pixel is a feature, and there are 10 possible classes.

Let’s first learn a bit more about this model. Gradient Boosted Trees and Random Forests are both ensembling methods that perform regression or classification by combining the outputs from individual trees. They both combine many decision trees to reduce the risk of overfitting that each individual tree faces. However, they differ in the way the individual trees are built, and the way the results are combined. Random forests use bagging to build independent decision trees and combine them in parallel. On the other hand, gradient boosted trees use a method called boosting. Boosting combines weak learners (usually decision trees with only one split, called decision stumps) sequentially, so that each new tree corrects the errors of the previous one.

The first step is to fit a single decision tree. In our case, the tree looks like this:

We’ll evaluate how well this tree does using a loss function. There are many different loss functions we can choose from. For multi-class classification, Cross entropy is a popular choice. Here’s the equation for cross entropy, where p is the label and q is the prediction.

Basically, the loss is high when the label and prediction do not agree, and the loss is 0 when they’re in perfect agreement.
Now that we have our first tree and the loss function we’ll use to evaluate the model, let’s add in a second tree. We want the second tree to be such that when added to the first, it lowers the loss compared to the first tree alone.
Keep in mind, we’re trying to move in the direction of lowering loss. We find the direction in which the loss decreases the fastest.

Mathematically, this is given by the negative derivative of loss with respect to the previous model’s output. Here’s what this looks like, where eta is the learning rate.

We choose the learning rate such that we don’t walk too far in any direction. At the same time, if the learning rate is too low, then the model might take too long to converge to the right answer.
Therefore, we fit the second weak learner on the derivative of the loss function with respect to F(1) or the ensemble at step 1. This is nothing but the gradient of the loss function with respect to the output of the previous model. That’s why this method is called Gradient Boosting.

For any step ‘m’, Gradient Boosted Trees produce a model such that:

Compared to random forests, Gradient Boosted Trees have a lot of model capacity, so they can model very complex decision boundaries and relationships. However, as with all models with high capacity, this can lead to overfitting very quickly. So be careful!
We fit a Gradient Boosted Trees model using the xgboost library on MNIST with 330 weak learners and achieved 89% accuracy. Try it out for yourself using this colab link, and let us know your thoughts! Subscribe to Econoscent on YouTube for the full Visual Guide to Machine Learning series and more!