Advanced Ensemble Learning Techniques

The ensemble is an art and science

Charu Makhijani
Towards Data Science

--

Photo by Jeswin Thomas on Unsplash

In my previous post about ensemble learning, I explained what is ensemble learning, how it relates to Bias and Variance in machine learning and what are the simple techniques of ensemble learning. If you haven’t read the post, please refer here.

In this post, I will cover ensemble learning types, and advanced ensemble learning methods — Bagging, Boosting, Stacking, and Blending with code samples. In the end, I will explain some pros and cons of using ensemble learning.

Ensemble Learning Types

Ensemble learning methods can be categorized into two groups:

1. Sequential Ensemble Methods

In this method base learners are dependent on the results from previous base learners. Every subsequent base model corrects the prediction made by its predecessor fixing the errors in it. Hence the overall performance can be increased by improving the weight of previous labels.

2. Parallel Ensemble Methods

In this method there is no dependency between the base learners and all base learners execute in parallel and the results of all base models are combined in the end (using averaging for regression and voting for classification problems).

Parallel Ensemble methods are divided into two categories-

1. Homogeneous Parallel Ensemble Methods- In this method, a single machine learning algorithm is used as a base learner.

2. Heterogeneous Parallel Ensemble Methods- In this method multiple machine learning algorithms are used as base learners.

Advanced Ensemble Techniques

Bagging

Bagging or Bootstrap Aggregation is a parallel ensemble learning technique to reduce the variance in the final prediction.

The Bagging process is very similar to averaging, the only difference is that bagging uses random sub-samples of the original dataset to train the same/multiple models and then combines the prediction, whereas in averaging the same dataset is used to train models. Hence the technique is called Bootstrap Aggregation as it combines both Bootstrapping (or Sampling of data) and Aggregation to form an ensemble model.

Image by Author

Bagging is a 3-step process-

2. Model is built (either a classifier or a decision tree) with every sample.

3. Predictions of all the base models are combined (using averaging or weighted averaging for regression problems or majority voting for classification problems) to get the final result.

All three steps can be parallelized across different sub-samples, hence the training can be done faster when working on larger datasets.

In bagging, every base model is trained on a different subset of data and all the results are combined, so the final model is less overfitted and variance is reduced.

Bagging is more useful when the model is unstable, with stable models bagging is not useful in improving the performance. Model is called stable when it’s less sensitive for small fluctuations in the training data.

Some examples of Bagging are — Random Forest, Bagged Decision Trees, and Extra Trees. sklearn library also provides BaggingClassifier and BaggingRegressor classes to create your own bagging algorithms.

Let's see this in the example below-

LogisticRegression :::: Mean: 0.7995780505454071 , Std Dev: 0.006888373667690784
Bagging LogisticRegression :::: Mean: 0.8023420359806932 Std Dev: 0.00669463780099821
DecisionTreeClassifier :::: Mean: 0.8119073077604059 , Std Dev: 0.005729415273647502
Bagging DecisionTreeClassifier :::: Mean: 0.849639923635328 Std Dev: 0.0046034229502244905
RandomForestClassifier :::: Mean: 0.8489381115139759 , Std Dev: 0.005116577814042257
Bagging RandomForestClassifier :::: Mean: 0.8567037712754901 Std Dev: 0.004468761007278419
ExtraTreesClassifier :::: Mean: 0.8414792383547726 , Std Dev: 0.0064275238258043816
Bagging ExtraTreesClassifier :::: Mean: 0.8511317483045042 Std Dev: 0.004708539080690846
KNeighborsClassifier :::: Mean: 0.8238853221249702 , Std Dev: 0.006423083088668752
Bagging KNeighborsClassifier :::: Mean: 0.8396364017767104 Std Dev: 0.00599320955270458
VotingClassifier :::: Mean: 0.8462174468641986 Std Dev: 0.006423083088668752

As we see in the example, Bagging Classifiers are improving the variance of ML models and reduce the deviation. The same is the case when using VotingClassifier which improves the average variance of ML models.

Boosting

Boosting is a sequential ensemble learning technique to convert weak base learners to strong learner that performs better and is less biased. The intuition here is that individual models may not perform very well on the entire dataset, but they work well on some parts of the entire dataset. Hence each model in the ensemble actually boosts the overall performance.

Boosting is an iterative method that adjusts the weight of an observation based on the previous classification. If an observation was classified incorrectly, then the weight of that observation is increased in the next iteration. In the same way, if an observation was classified correctly then the weight of that observation is reduced in the next iteration.

Image by Author

Boosting is used to decrease the bias error, but it can also overfit the training data. That is why parameter tuning is an important part of boosting algorithms to make them avoid overfitting the data.

Boosting was originally designed for classification problems but extended for regression problems as well.

Some examples of Boosting algorithms are — AdaBoost, Gradient Boosting Machine (GBM), XGBoost, LightGBM, and CatBoost.

Let's see Boosting with an example-

AdaBoostClassifier :::: Mean: 0.8604337082284473 Std Dev: 0.0032409094349287403
GradientBoostingClassifier :::: Mean: 0.8644262257222698 Std Dev: 0.0032315430892614675
XGBClassifier :::: Mean: 0.8641189579917322 Std Dev: 0.004561102596800773
VotingClassifier :::: Mean: 0.864645581703271 Std Dev: 0.0032985215353102735

Stacking

Stacking, also known as stacked generalization, is an ensemble learning technique that combines multiple machine learning algorithms via meta learning (either a meta-classifier or a meta-regressor).

The base-level algorithms are trained on the entire training dataset, and then the meta-model is trained on the predictions from all the base-level models as features. The base models are called level-0 models, and the meta-model which combines the base model's predictions is called a level-1 model.

Image by Author

The level-1 model training data is prepared via k-fold cross-validation of the base models, and out-of-fold predictions (real numbers for regression and class labels for classification) are used as the training dataset.

The level-0 models can be either a diverse range of algorithms or the same algorithm (most often they are diverse). The level-1 meta-model is most often a simple model, like Linear Regression for regression problems and Logistic Regression for Classification problems.

The stacking method can reduce the bias or variance based on the algorithms used in level-0.

There are many libraries for Stacking, like — StackingClassifier, StackingRegressor, make_classification, make_regression, ML Ensemble, and H20.

Let's see Stacking with an example using StackingClassifier-

LogisticRegression :::: Mean: 0.799198344149096 Std Dev: 0.004958323931953346
DecisionTreeClassifier :::: Mean: 0.8130779055654346 Std Dev: 0.008467878845801694
KNeighborsClassifier :::: Mean: 0.8251287819886122 Std Dev: 0.00634438876282278
SVC :::: Mean: 0.8004562250294449 Std Dev: 0.005221775246052317
GaussianNB :::: Mean: 0.7964780515718138 Std Dev: 0.004996489474526567
StackingClassifier :::: Mean: 0.8376917712960184 Std Dev: 0.005593816155570199

As we see in this example, we have used different ML models in level 0 and stacked them with StackingClassifier using LogisticRegression in level 1, and it has improved the variance.

Multi-level Stacking

Multi-levels Stacking is an extension of stacking where stacking is applied on multiple layers.

Image by Author

For example in a 3-level stacking, Level-0 is the same where a diverse range of base learners are trained using k-fold cross-validation. In level-1, instead of a single meta-model, N such meta-models are used. In level-2, the final meta-model is used that takes the predictions from N meta-models of level-1.

Adding multiple levels is both data expensive (as lots of data needed to be trained) and time expensive (as each layer adds multiple models).

Blending

Blending is most often used interchangeably with Stacking. And it is almost similar to Stacking with only one difference, Stacking uses out-of-fold predictions for the training set whereas Blending uses a hold-out (validation) set (10–20% of the training set) to train the next layer.

Although Blending is simpler than stacking and uses fewer data, the final model may overfit on the holdout set.

Advantages/Benefits of ensemble methods

1. Ensemble methods have higher predictive accuracy, compared to the individual models.

2. Ensemble methods are very useful when there is both linear and non-linear type of data in the dataset; different models can be combined to handle this type of data.

3. With ensemble methods bias/variance can be reduced and most of the time, the model is not under fitted/overfitted.

4. Ensemble of models is always less noisy and more stable.

Disadvantages of Ensemble learning

1. Ensembling is less interpretable, the output of the ensembled model is hard to predict and explain. Hence the idea with the ensemble is hard to sell and get useful business insights.

2. The art of ensembling is hard to learn and any wrong selection can lead to lower predictive accuracy than an individual model.

3. Ensembling is expensive in terms of both time and space. Hence ROI can increase with ensembling.

Conclusion

After looking at the fundamentals of ensemble learning techniques above and the pros/cons of ensemble learning, it won’t be wrong to say that if used correctly ensemble methods are great for improving the overall performance of ML models. When it's hard to rely on one model, an ensemble makes life easier and that’s the reason why in most of the ML competitions ensemble methods are the choices for winners.

Although selecting the right ensemble methods and using them is not an easy job but this art can be learned with experience. The techniques described in this post are most often a reliable source for ensembling, but other variants are also possible based on specific problems/needs.

To access the complete code for the Advanced Ensemble Techniques, please check this GitHub link.

Thanks for the read. If you like the story please like, share, and follow for more such content. As always, please reach out for any questions/comments/feedback.

Github: https://github.com/charumakhijani
LinkedIn:
https://www.linkedin.com/in/charu-makhijani-23b18318/

--

--

ML Engineering Leader | Writing about Data Science, Machine Learning, Product Engineering & Leadership | https://github.com/charumakhijani