
A tree ensemble is a machine learning technique for supervised learning that consists of a set of individually trained decision trees defined as weak or base learners, that may not perform well individually. The aggregation of the weak learners produce a new strong model, which is often more accurate than the former ones. There are three main types of ensemble learning methods: Bagging, boosting, and gradient boosting. Every method can be used with other weak learners, but in this post, only trees are going to be taken into account.
The rest of the article is divided into two sections:
- Intuition & History. It explains the origin and gives a short description of each Ensemble Learning method.
- Practical Demonstration. Develops every ensemble learning method step by step. For this purpose, a small synthetic dataset is also presented to help with the explanations.
All images unless otherwise noted are by the author.
Intuition & History
Bagging
The term was first defined by Breiman (1996) [1], and it is an acronym for Bootstrap Aggregation. For this ensemble, each decision tree uses as input data a bootstrap sample of the training dataset. A bootstrap sample is a randomly selected sample with replacement, which means that observations can appear once, more than once, or never. Then, all the predictions are combined using a statistic such as the average. Random Forest [2] uses this technique (among others) and is one of the most successful and widely used ensemble methods.
![Bagging visual explanation. Each sample with replacement serves as input data for the weak learner. [3]](https://towardsdatascience.com/wp-content/uploads/2023/01/1nY2ofTx3lKAa3gt3p_oByA.png)
Boosting
Keans and Valiant (1988, 1989) [4][5] left open the following question: Can a set of weak learners create a single strong learner? In 1990, Schapire’s affirmative answer [6] led to the development of boosting. Boosting, as opposed to bagging, takes a more iterative approach in which trees are trained sequentially. Observations have assigned a weight, and each tree is built in an additive manner, assigning greater weights (more importance) to misclassified observations in the previous learners. There are many boosting algorithms, but the first one to take full advantage of weak learners was AdaBoost [7], formulated by Freund and Schapire in 1995.
![AdaBoost visual explanation. Each learning is focused on misclassified observations of the previous tree [8]](https://towardsdatascience.com/wp-content/uploads/2023/01/0-Xbak3wIYrvpVRG2.png)
Gradient Boosting
Breiman proposed a "Gradient Boosting Machine" [9] that also takes an iterative approach in which trees are trained sequentially. However, instead of having weights to update, the trees are fitting the pseudo residuals of the previous tree. The first pseudo residuals are the result of subtracting the real value from the average of the output feature. Many algorithms such as XGBoost [10], CatBoost [11] or LightGBM [12] are based on this technique.
![Gradient Boosting Visual Explanation [13]](https://towardsdatascience.com/wp-content/uploads/2023/01/1SooaUqZTZY6HhTc52TGb9w.png)
Practical Demonstration
The practical demonstration is highly inspired by Josh Starmer’s Youtube channel. If you want a visual and extremely concise explanation of almost any Machine Learning model or technique, just check it out!
Disclaimer: the data have been tailor-made so that the results show what is expected. This does not mean that the techniques are infallible or that they can be compared with each other.
Data
When I learn new algorithms or methods, I really like to test them in small datasets where I can focus on the details. So for the practical demonstration, we will use a small synthetic dataset containing made-up information about houses and their price. Bagging and boosting models are going to be explained with the price feature transformed into a categorical feature, while gradient boosting is going to be explained with the price as a numerical feature.
All the Jupyter Notebooks developed for the article can be found in this GitHub Repository, but the main pieces of code are still shown along the reading. For example, the data creation process can be seen below.
The data created have 10 customers and 6 features:
- Square Meters: Numerical
- If the house has a garage: Boolean
- If the house has a garden: Boolean
- Number of rooms: Numerical
- Price: Categorical or Numerical
As it is said before, the price feature will be categorical or numerical depending on the tree ensemble explained.


Bagging
First, remember that bagging is an acronym for Bootstrap Aggregation. These two words lead the way, so let’s start with the first one: bootstrap. The following code builds 5 **** bootstrap samples of 7 observations from the data generated in the previous section.

As defined before, the randomly selected samples are done with replacement, that’s why the samples have repeated indexes. The next step trains a weak learner for each sample. In this case, the learner chosen is the Decision Tree Classifier from scikit-learn [15]
Once every tree is trained, let’s dive into the first sample and its corresponding tree. The first bootstrapped sample had the observations [6, 0, 2, 2, 7, 6, 8], so these are the observations used by the tree to train. The induced tree shows that can correctly classify every observation but one (observation = 0). If you follow the tree, it is clear that it is going to be classified as LOW, while the price is actually MEDIUM. Nevertheless, we can say the performance of the tree is somewhat good.

However, take a look at the observation indexes again, there are only 5 unique values [0, 2, 6, 7, 8], so the first tree is taking only 50% of the sample for training. Thus, the resulting accuracy can be misleading. To understand better the decision tree performance, we are going to predict the whole dataset (10 observations) with each decision tree. Furthermore, let’s make use of the second word that makes up bagging: aggregation. Our strong learner (called bagging) will take the most common prediction from the five trained trees.

As we can see below, in terms of accuracy the weak learners (trees) perform worse individually than the strong learner (bagging).

Boosting
Boosting, as opposed to bagging, trains trees sequentially. A first tree is trained with all the data. In the following tree, misclassified observations are given a higher weight (scikit learn’s decision tree classifier has a weights parameter). That weight allows the tree to focus more on certain observations. Here we have the code used to implement a boosting ensemble.
As we said, the first step is to train the first tree with all the observations.

Note that the tree is not able to learn properly the observations [0, 6, 9]. So, according to boosting theory and the code, these observations will have higher weights in the second tree.

As we can see, the second tree trained with the updated weights is able to learn properly the previously misclassified observations [0, 6, 9]. Higher weights assigned to these indexes have forced the tree to learn them properly. However, the tree learning process has changed also for the rest of the observations. It fails now to learn observations [3, 4, 7].
Therefore, the third tree would double up the weights on these misclassified observations and, tree after tree, they will correct the errors made by previous learners. In the next picture, it is shown how each tree improves the errors of the previous one. Besides, there is a column called boosting that chooses the most common classification of all the trees.

If we calculate the accuracy of each tree and the boosting aggregation, the results clearly show that the boosting technique also improves the results of each individual tree.

Gradient Boosting
Remember that for gradient boosting we are going to use the DataFrame whose target variable price is numeric. Also, remember that gradient boosting, like boosting, has an iterative approach. The difference is that the trees fit the pseudo residuals of the previous tree, instead of the same observations with different weights.
The first thing we are going to do is to calculate the first pseudo residuals (_residuals0), which __ are the result of subtracting the mean of the price from the real value of the price.

The Mean Absolute Error (MAE) of predicting the values using the mean to every observation is 100397.68. This is the metric that is going to be improved with each trained tree. That being told, let’s train the first tree with the pseudo residuals shown before as the target.

Observe that each final node has a different number of samples. In the first one on the left, there is only one observation. It corresponds to observation with index 1 whose pseudo residual is –138461.4. In the second final node from left to right, there are 5 samples which correspond to observations with indexes [0, 2, 3, 6, 7]. The predicted residual value by the tree (-72705) is the average value of the 5 residuals mentioned. These values will be known in advance as predicted pseudo residuals.
To predict real price values with this first tree we should do the following operation.

The predictions shown below achieve a MAE of 70347.53, which improves the MAE achieved before by predicting only with the mean (100397.68). With these predictions, we can calculate the next pseudo residuals (_residuals1), which are the result of subtracting the predictions of the first tree (_predictions0) from the real value of the price.

Note that _residuals1 are always closer to zero than _residuals0. This is because we are using a portion of the information given by the tree to improve the prediction made by the mean. This portion percentage is defined by the learning rate.
Until now, only one tree has been involved, but we said before that gradient boosting use several trees sequentially. It’s time to train the second tree, and for this purpose, the residuals previously calculated (_residuals1) are going to be used as the target. The second trained tree looks as follows:

If we follow the same steps as in the first tree, we get the following results.

The only difference with the first tree is that we used both the first and second trees’ pseudo residuals predictions (_residual_predictions1 and _residual_predictions1) for making the predictions _ (predictions_1). Instead of doing an aggregation of the predictions as it is done in bagging and boosting, gradient boosting adds to the mean price small portions of information of each tree (learning rate * residuals preds tree_ x).

After training 5 trees, we can clearly see how the MAE has been reduced in the following results. Each iteration provides better learning, and it is clear that the union of all the trees provides better metrics than using them individually.

Conclusion
This article is intended to be a step-by-step guide which explains how different ensembling learning methods work: Bagging, Boosting and Gradient Boosting.
In the first section, we have seen the history, description and uses of these techniques, while in the second one, we have developed in a practical way all the models, showing how the union of different weak learners can create a strong learner with a higher predictive power.
I hope you have found the reading useful and enjoyable. And above all, I am happy to receive any kind of feedback. So feel free to share your thoughts!
References
[1] Breiman, L. «Bagging Predictors». Machine Learning 24, (1996): 123–40. https://doi.org/10.1007/BF00058655.
[2] Breiman, L. «Random Forests». Machine Learning 45, (2001): 5–32. https://doi.org/10.1023/A:1010933404324.
[3] Image Source: https://www.researchgate.net/figure/The-bagging-approach-Several-classifier-are-trained-on-bootstrap-samples-of-the-training_fig4_322179244
[4] Kearns, M. «Thoughts on Hypothesis Boosting», Unpublished manuscript (Machine Learning class project) (1988)
[5] Kearns, M; Valiant, L. «Cryptographic limitations on learning Boolean formulae and finite automata». Symposium on Theory of Computing 21, (1989): 433–444 https://dl.acm.org/doi/10.1145/73007.73049
[6] Schapire, R.E. «The Strength of Weak Learnability». Machine Learning 5, (1990): 197–227 https://doi.org/10.1007/BF00116037
[7] Freund, Y.; Schapire, R.E. (1995). «A decision-theoretic generalization of online learning and an application to boosting». Computational Learning Theory, (1995) https://doi.org/10.1007/3-540-59119-2_166
[8] Image Source: https://www.researchgate.net/figure/Training-of-an-AdaBoost-classifier-The-first-classifier-trains-on-unweighted-data-then_fig3_306054843
[9] Friedman, J.H. «Greedy Function Approximation: A Gradient Boosting Machine» The Annals of Statistics 29 (2001). https://doi.org/10.1214/aos/1013203451.
[10] Chen, Tianqi, and Carlos Guestrin. «XGBoost: A Scalable Tree Boosting System» Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). https://doi.org/10.1145/2939672.2939785.
[11] Dorogush, A.V.; Ershov, V.; Gulin. A. CatBoost: Gradient Boosting with Categorical Features Support». ArXiv:1810.11363, 24 (2018). http://arxiv.org/abs/1810.11363.
[12] Ke, G.; Meng, Q.; Finley, T; Wang, T; Chen, W; Ma, W; Ye, Q; Liu, T. «LightGBM: A Highly Efficient Gradient Boosting Decision Tree». Advances in Neural Information Processing Systems, 20 (2017). https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html.
[14] StatQuest with Josh Starmer (Youtube Channel): https://www.youtube.com/c/joshstarmer
[15] https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html