Getting Started

Gradient Boosted Decision Trees Explained with a Real-Life Example and Some Python Code

Gradient Boosting algorithms tackle one of the biggest problems in Machine Learning: bias.

Carolina Bento

Published in

Towards Data Science

9 min readAug 19, 2021

This is the third and last article in a series dedicated to Tree Based Algorithms, a group of widely used Supervised Machine Learning Algorithms.

The first article was about Decision Trees, while the second explored Random Forests. Everything explained with real-life examples and some Python code.

Gradient Boosting algorithms tackle one of the biggest problems in Machine Learning: bias.

Decision Trees is a simple and flexible algorithm. So simple to the point it can underfit the data.

An underfit Decision Tree has low depth, meaning it splits the dataset only a few of times in an attempt to separate the data. Because it doesn’t separate the dataset into more and more distinct observations, it can’t capture the true patterns in it.

*Bias of a simplistic (left) vs a complex model (right). [*Image by author]

When it comes to tree-based algorithms Random Forests was revolutionary, because it used Bagging to reduce the overall variance of the model with an ensemble of random trees.

In Gradient Boosted algorithms the technique used to control bias is called Boosting.

In the late 1980’s and 1990’s, Robert Schapire and Yoav Freund developed one of the most popular Boosting algorithms, AdaBoost, to address underfitting and reduce bias.

But they were not alone. The field of Computer Science and Theoretical Machine Learning was riding this wave of enthusiasm and groundbreaking algorithms, and more scientists started developing Boosting approaches.

Friedman explored how Boosting can be the optimization method for an adequate loss function. While Schapire and Freund took inspiration on a question posed by Michael Kearns, Can a set of weak learners create a single strong learner?

In this context, a weak learner is any model that is slightly better than a random model, and will never efficiently achieve a training error of zero. A strong learner is the opposite, a model that can be improved and efficiently bring the training error down to zero.

Around the same time Jerome Freidman, was also experimenting with Boosting. He ended up developing several new algorithms, the most popular being Gradient Boosted Decision Trees.

The ingenious efforts of these and other scientists contributed to a vibrant new chapter in Machine Learning. It generated robust and powerful algorithms. Many of which are still relevant today.

Boosting 101

The motto with Boosting is that the result is greater than the sum of its parts.

Boosting is based on the assumption that it’s much easier to find several simple rules to make a prediction than it is to find the one rule that is applicable to all data and generates the best possible prediction[1].

The intuition behind Boosting is that you train the same weak learner, a model with simple rules, several times. Then combine its weak predictions into a single, more accurate result. The model is always of the same type same but, each time, it is trained with different weights for the observations it misclassified.

Predictions that were previously misclassified will have more weight on the next weak learner, so each weak learner focuses on the hardest observations[1].

*Boosting algorithms attribute more weight to the observations it misclassified. (*Image by author)

When you combine different weak learners sequentially, each one will make some incorrect predictions. But they will cover each others blindspots, making better predictions where other learners were not as good. Each model learns slowly from the errors the previous one made[2].

What’s common across all Boosting Algorithms

All Boosting algorithms are ensemble Classifiers or Regressors, depending on the task at hand, because they combine the results of several models.

*An ensemble of m weak learners. (*Image by author)

In the aggregation step of a Regression task you might compute the weighted sum of all predictions for each observation. While in a classification task, a majority vote decides which class to assign.

Creating a Boosting Ensemble

Most algorithms focus on parameter estimation, to find the parameters that minimize their loss function. Boosting algorithms focus on function estimation.

The ensemble of weak learners is actually a set of functions that map features to targets and minimize the loss function[3].

Each new *weak learner* added minimizes the ensemble’s loss function. (Image by author)

And because the optimization algorithm used to find the next weak learner is Gradient Descent, the loss function must differentiable and convex.

However, this time Gradient Descent is used in a function space.

Typically, Gradient Descent starts off at a random point in a differentiable convex function, and only stops when it finds its minimum value.

The same concept is translated Boosting algorithms with a Gradient Descent, but in the function space.

The algorithm starts off with a very simple weak learner, for instance, always predicting the class 0.

With each weak learner added to the ensemble the algorithm is taking a step towards the right targets, because it only adds weak learners that minimizes the loss function.

*Loss function for the ensemble. (*Image by author)

The exponential loss is typically used in Classification tasks while in Regression tasks it’s the Squared Loss.

Gradient Boosted Decision Trees vs Adaboost vs …

All Boosting algorithms share a common principle, they use Boosting to create an ensemble of learners that are later combined. But each of them has distinct characteristics that make them more suitable for particular tasks.

For instance, in Gradient Boosted Decision Trees, the weak learner is always a decision tree.

In Stochastic Gradient Boosting, Friedman introduces randomness in the algorithm similarly to what happens in Bagging. At each iteration, instead of using the entire training dataset with different weights, the algorithm picks a sample of the training dataset at random without replacement and uses that to train the next learner[4].

Adaboost is in many ways similar to Gradient Boosted Decision Trees, in a sense that you have the distribution of weights for each observation in the training dataset and, in each iteration the algorithm adjusts those weights based on how the previous weak learner performed. But its adaptive nature comes into play when, during each iteration, it chooses a new the learning rate alpha based on the weak learner’s error rate[1].

Now that you know a bit more about the different Boosting algorithms, let’s see them in practice!

🛩🏝 Reducing model Bias with Boosting Algorithms

Planning a vacation is challenging. The hardest part, picking a destination.

But you’re a Data Scientist. You’re confident that Machine Learning can give an algorithmic opinion on what should be your next vacation destination.

Whenever planning a vacation, you always take into account:

Duration of the vacation,
Personal budget,
Weather forecast,
If your extended family is joining,
If you’re feeling adventurous and want to explore new places.

When it comes to picking a classification algorithm, you know Decision Trees are said to mimic how humans make decisions. But it has limitations, just a single Decision Tree is not a very powerful predictor.

Random Forests would get you better results, if you’re focused on controlling bias, not model variance. So Random Forests is not an option.

Since using a low bias model is your goal, Boosting algorithms are exactly what you’re looking for.

The firs thing you do is to go through your old notes and collect some data on how you picked previous vacation destinations. You gather all the information like your budget at the time and the duration of the vacation, and your previous destinations, which were typically a choice between Beach and Countryside.

In the code below I’m creating a synthetic dataset to simulate that data.

Creating a synthetic dataset with 250 observations. (Image by author)

Spotting Model Bias

The whole reason you want to use a Boosting algorithm is to control model bias.

In a Regression task, where you predict numerical values, Bias is described by this equation:

In this case, you just need plug the numbers in the formula.

But you’re trying to get an algorithmic opinion on your next vacation destination, a classification task.

For this purpose, you can use the misclassification rate, i.e, average of the 0–1 loss.

If your model has a high proportion of misclassified observations in the training set, it means it has high bias.

Boosting algorithms ✅ Method to calculate Bias ✅

So you start with the the simplest algorithm Decision Trees. With ScikitLearns’ Decision Tree Classifier you create a single decision tree that only splits the dataset twice. That’s why max_depth=2.

Code sample to train a Decision Tree. (Image by author)

The average misclassification rate in the training set, the 0–1 loss, is 44%. That’s almost as good as a random model! High Bias ✅

Result of 0–1 misclassification error for the Decision Tree algorithm. (Image by author)

Now you turn to a Gradient Boosted Classifier.

You set the depth of each weak learner in the ensemble to 1, using the parameter max_depth once again. In this case you’re training the smallest tree possible, what in the machine learning lingo it’s called a decision stump.

And in terms of the size of the ensemble, you picked a rather arbitrary number, 210 trees. That’s the parameter n_estimators in the GradientBoostingClassifier method.

Code sample to train Gradient Boosted Decision Trees. (Image by author)

Gradient Boosted Decision Trees did indeed reduce Bias. A 0–1 loss of 32% is much better!

Result of 0–1 misclassification error for the Gradient Boosted Decision Trees algorithm. (Image by author)

Ok, you built a model with relatively low bias. But you know there are other Boosting algorithms out there, besides Gradient Boosted Decision Trees.

How much better would Adaboost be at predicting your next vacation destination?

Code sample to train an Adaboost model. (Image by author)

In the case of Adaboost, training the same number of trees didn’t reduce bias as much. But it still brought the misclassification rate down to 38%, from the 44% observed with Decision Trees.

Result of 0–1 misclassification error for the Adaboost algorithm. (Image by author)

When it goes to picking your next vacation destination, with the dataset at hand, Gradient Boosted Decision Trees is the model with lowest bias.

Now all you need to do is give the algorithm all information about your next vacation, i.e., the duration, if you’re feeling adventurous and all that.

The algorithm will tell you if your next vacation should be a Beach vacation or on the Countryside.

Conclusion

Now you can be confident about using Gradient Boosting Decision Trees to predict your next vacation destination. Instead of training just a single Decision Tree.

What sets Gradient Boosted Decision trees apart from other approaches is:

Speed, the algorithm is memory efficient so you can make predictions much faster.
No preprocessing required you don’t need to prepare the data before building the model.
Data robustness the algorithm handles all types of data nicely.

Event thought it is more complex than Decision Trees, it continues to share the advantages of tree-based algorithms.

But there’s always a catch. Gradient Boosted Decision Trees make two important trade-offs:

Faster predictions over slower training,
Performance over interpretability.

Faster predictions vs slower training

Even though one of the advantages mentioned above is the memory efficiency, that is when it comes to make predictions.

The algorithm is more complex and builds each tree sequentially. It’s predictions are fast, but the training phase is computationally more demanding.

Performance over interpretability

Similarly to what happens with Random Forests, you trade-off performance for interpretability.

You’ll no longer be able to see exactly how the algorithm split each tree node to reach the final prediction.

Hope you enjoyed learning about Boosting algorithms, the problems the address and what ultimately makes them unique.

Make sure to check the other two articles in this Tree-based Machine Learning algorithms series. The first one explores Decision Trees and the second Random Forests.

Thanks for reading!

References

Schapire Robert E. (2002) The Boosting Approach to Machine Learning: An Overview
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. (2013). An introduction to statistical learning : with applications in R. New York :Springer
Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232.
Friedman, J.H., 1999. Stochastic Gradient Boosting. Computational Statistics and Data Analysis. 38 pp. 367–368