Overview
Ensemble methods in machine learning are algorithms that make use of more than one model to get improved predictions. This post will serve as an introduction to tree-based Ensemble methods. We will first go over how they utilize the delphi method to improve predictive power with Bootstrap Aggregation (Bagging for short). Then, move into boosting, a technique where algorithms use a combination of weak learners to boost performance. The following ensemble algorithms from scikit-learn will be covered:

With tree-based ensemble methods, we know that there isn’t a model that will give us perfect estimates, so we use multiple models to make predictions, and average them. By doing this, the overestimates and underestimates will more than likely cancel out and improve our prediction.
Bootstrap Aggregation
Bagging, short for Bootstrap Aggregation, is one of the main concepts making ensembling methods possible. Bootstrap aggregation is made up of two ideas, bootstrap resampling and aggregation. Bootstrap resampling is a sampling technique in which subsets of the dataset are created, with replacement. Bootstrap aggregation is a technique that uses these subsets and averages their predictions. This procedure is highly applicable to decision trees since they are prone to overfitting and will help to reduce the variance of the predictions. The process for training an ensemble algorithm with bootstrap aggregation is:
- Take a sample of the dataset, with replacement
- Train a classifier on this subset
- Repeat steps 1 and 2 until all classifiers have gone through training on their own subset
- Make a prediction with each classifier in the ensemble
- Aggregate the predictions from all the classifiers into one, with an ensemble method, e.g., max voting, averaging, weighted averaging.

Let’s go ahead and fit a single decision tree classifier for a baseline before using ensemble classifiers. I will be using the Titanic dataset for binary classification, with the target being the Survived
feature. For the purposes of this overview, the dataset being used has been cleaned. We will be using pipelines from sklearn to perform the preprocessing steps. For a description on the features in the dataset, see the data dictionary below.




The decision tree classifier is overfitting to the training data with an accuracy score of 98%. The classifier is not performing as well on the test data with an accuracy score of 76%. To prevent the decision tree from overfitting we could set the hyperparameters, such as the max depth of the tree. However, we will generally get better results by creating another decision tree. To do this, we can fit an ensemble of bagged decision trees with a bagging classifier. A bagging classifier is a tree-based ensemble method that fits a classifier on each random subset drawn from the data, then aggregates the individual predictions into one. Below, we’ll create a bagging pipeline with sklearn’s BaggingClassifier. We will set the n_estimators
parameter to 200 (200 classification trees) and the max_samples
parameter to 20, (number of samples drawn to train each estimator).

By using bagging classifier we were able to improve the accuracy on the test set to 79%. This classifier is not overfitting, with similar accuracy scores on both the train and test sets. We’re Bootstrapping by resampling the data with replacement and aggregating by using all predictions to come up with one. Since we’re bootstrapping and aggregating in the same model, we are bagging.
Random Forests

A random forest classifier is an ensemble method similar to a bagging classifier but uses a subset of features and not all of them from the dataset. In the bagging phase of a random forest classifier, two-thirds of the train data is sampled with replacement for each tree in the ensemble. This portion is used to build the tree and the remaining one-third is used to calculate the Out-Of-Bag Error, a running, unbiased estimate of performance of each tree in the ensemble. Random forests also make use of the Subspace sampling method to provide more variability amongst the trees in the ensemble. This method randomly selects only a subset of features for each node in a tree. By using these subsets we’ll end up with a random forest containing diverse decision trees. Having the ensemble of decision trees which have been trained on different subsets of the data, the model will be less sensitive noisy data. Let’s go ahead and create a pipeline with a random forest classifier, we’ll set the n_estimators
parameter to 500 and the max_samples
parameter to 20.

The random forest classifier achieved an accuracy of 81% on both the test and train sets. This is a slight improvement from the bagging classifier but we also used 500 classification trees instead of 200. For a more detailed explanation, here’s a great paper on random forests.
Extremely Randomized Trees

If we wanted to add even more randomization to our ensemble, we can use the extra trees classifier. Similarly with the random forest classifier, extra trees will also use a random subset of the features, but instead of choosing the most optimal branching path, one will be chosen at random. This reduces our model’s reliance on features in the training data, and will help to prevent overfitting. We will fit and extra trees classifier below with the n_estimators
parameter set to 400 and the max_samples
parameter to 30. Something to note with the extra trees algorithm, the bootstrap parameter is set to False as the default. In order to sample the data with replacement, we need to set this parameter to true.

The extra trees classifier performed slightly better than the random forest classifier with a 1% increase in accuracy.
Boosting

Boosting refers to the process of sequentially training weak learners to build a model. Weak learners are machine learning algorithms that perform slightly better than random chance. Predictions for boosting algorithms usually assign weights to determine the importance of each learner’s input. By combining the weights given to each learner, the collective weight of the trees with correct classifications will overrule the trees with higher weights from incorrect classifications. The boosting process is as follows:
- Train one weak learner
- Evaluate where the learner misclassified
- Train another weak learner focusing on the areas where the previous learner misclassified
- Repeat this process until some stopping criteria is met, e.g., performance plateau
Adaptive Boosting
Adaboost (short for adaptive boosting) was the first boosting algorithm invented. It works by sequentially updating two sets of weights, one to the data points and one to the weak learners. The points that are classified incorrectly are given a greater weight so the next weak learner can focus on these points. At the end of the sequence, a higher weight is given to the learners that made better predictions. This especially applies to the learners with correct predictions on the data points previous learners did not classify correctly. The weights given to the learners are then used as a final vote to determine the ensemble’s prediction. The main idea here is adaboost creates new classifiers by continually adjusting the distribution of the data being sampled to train next classifier.

The adaboost classifier performed pretty well using just the default parameters, and also achieved a greater f1 score than the strong learners used above.
Gradient Boosting
The gradient boosted trees algorithm is more advanced than adaboost and uses gradient descent. Similarly to adaboost, gradient boosting also begins by training a weak learner. However, it goes a step further and calculates the residuals for each data point. This will indicate how wrong the learner’s prediction was. The overall loss is calculated with the residuals and a loss function. The gradients and the overall loss are then used to train the next learner. The loss where the learner was wrong is given heavier weights, which in turn, allows us to use gradient descent to push the algorithm into creating the next learner to focus on these data points. Something to note, the step size in gradient descent is generally something we want to limit. This results in smaller learning rates to help find the most optimal values to converge on.

The gradient boosting classifier looks to be performing well on the test set with an accuracy score of 82%. However, this model does look to be overfitting to training data with a 9% drop off in accuracy scores.
Extreme Gradient Boosting
XGBoost (short for Extreme Gradient Boosting) is currently the highest performing version of gradient boosting algorithms. There are many optimizations "under the hood" of xgboost that provide the fastest training times over all the gradient boosting algorithms. A big one being that xgboost has been configured to evenly distribute the construction of trees across all cores in a computer’s CPU. XGBoost makes use of second order derivatives to minimize the error as opposed to using the loss function of the base learner like in gradient boosted trees.

The XGBoost classifier performed the best on the test set with an accuracy score of 84%. This was achieved with a learning rate of 0.15 (default is 0.1). The classifier does look to be slightly overfitting to the training data with a 6% drop off in accuracy scores between the train and test sets.
Conclusion
Ensemble methods in machine learning can significantly improve the performance of a model. While we went over a few various algorithms, there are many that make use of these ensemble techniques and I’d recommend taking a further look into them. The classifiers used above were not optimized and they can be improved upon. Performing a grid search with these classifiers would help in finding the most optimal parameters, but will also be more time consuming. I hope the concepts covered were clear and helpful. Thank you for taking the time to check out my post!
References:
- Delphi Method – Overview, Process, and Applications. (2020, July 15). Retrieved from https://corporatefinanceinstitute.com/resources/knowledge/other/delphi-method/
- _Ensemble Learning in Python. (n.d.). Retrieved from https://www.datacamp.com/community/tutorials/ensemble-learning-python?utm_source=adwords_ppc&utm_campaignid=1565261270&utm_adgroupid=67750485268&utm_device=c&utm_keyword=&utm_matchtype=b&utm_network=g&utm_adpostion=&utm_creative=295208661502&utm_targetid=aud-299261629574:dsa-429603003980&utm_loc_interest_ms=&utm_loc_physical_ms=9021719&gclid=Cj0KCQiA6Or_BRC_ARIsAPzuer_IwGctWQXKFKVmaCWUFQg9LwinV7_4g3Ku0TBKROwvCEef34a_XNsaAor-EALw_wcB_
- 1.11. Ensemble methods. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/ensemble.html
- Sklearn.ensemble.BaggingClassifier. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
- Sklearn.ensemble.RandomForestClassifier. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- Bootstrap Aggregation, Random Forests and Boosted Trees. (n.d.). Retrieved from https://www.quantstart.com/articles/bootstrap-aggregation-random-forests-and-boosted-trees/
- Jones, C. (2014, February 12). Retrieved from https://businessforecastblog.com/random-subspace-ensemble-methods-random-forest-algorithm/
- Sklearn.ensemble.GradientBoostingClassifier. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- Sklearn.ensemble.AdaBoostClassifier. (n.d.). Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
- Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014, November 18). The Evolution of Boosting Algorithms – From Machine Learning to Statistical Modelling. Retrieved from https://arxiv.org/abs/1403.1452
- Xgboost Documentation. (n.d.). Retrieved from https://xgboost.readthedocs.io/en/latest/