The Power of Ensembles in Deep Learning

Introducing DeepStack, a python package for building Deep Learning Ensembles

Published in

Towards Data Science

7 min readNov 18, 2019

Ensemble Building is the leading winning strategy for machine learning competitions and often the technique used for solving real-world problems. What often happens is that while solving a problem or participating in a competition you end up with several trained models, each one with some differences to another — and you end up picking up your best model based on your best evaluation score. The truth is that your best model “knows less” about the data, than all others “weak models” combined. Combining several base models together to create a more powerful ensemble model then rises as a natural by-product of this workflow.

“There is really no way you can win most Kaggle competitions without a very strong ensemble. As good as your best individual model is, it won’t match a good ensemble.” Giuliano Janson at quora (2016)

Ensembles range from simple to weighted averages up to complicated 2nd and 3rd level meta-models. Back into building my ensemble models for my machine learning competitions I ended up creating several blocks of code that helped me test different ensemble strategies and win or perform much better in such challenges. So, I just as the natural question: what if I could make building Deep Learning Ensembles as easy as writing a few lines of python code and share this with the community? These solo projects gestated into scripts and matured into a python packaged I named DeepStack.

Short Introduction to Ensembles

Ensemble Learning is all about learning how to best combine predictions from multiple existing models (called the base-learners). Every member of the ensemble makes a contribution to the final output and individual weaknesses are offset by the contribution the other members. The combined learned model is named the meta-learner. There are several flavors of Ensemble Learning. In this article we will focus on particularly 2 methods currently supported by DeepStack. Namely:

#1 Stacking: in stacking the output of the base-learners are taken as input for training a meta-learner, that learns how to best combine the base-learners predictions.

**Stacking** combines multiple predictive models in order to generate a new combined model.

Often times the stacking model will outperform each of the individual models due to its smoothing nature, offsetting deficiencies of individual models leading to a better prediction performance. Therefore, stacking works best when the base models are basically different.

#2 Weighted Average Ensemble: this method weights the contribution of each ensemble member based on their performance on a hold-out validation dataset. Models with better contribution receive a higher weight.

**Weighted Average Ensemble** is all about weighting the predictions of each base-model generating a combined prediction.

The main difference between both methods is that in stacking, the meta-learner takes every single output of the base-learners as a training instance, learning how to best map the base-learner decisions into an improved output. The meta-learner can be any classic known machine learning model. The weighted average ensemble on the other hand is just about optimizing weights that are used for weighting all outputs of the base-learner and taking the weighted average. There is no meta-learner here (besides the weights). Here the number of weights is equal to the number of existing base-learners.

Needless to say, when building an ensemble you need a training dataset which has not been seen by the base-learners in order to avoid overfitting. You also need a hold-out (validation) dataset to guide / test the performance of your ensembles.

The are several techniques for ensemble learning and I would definitely recommend this article of Joseph Rocca to get deeper into concepts such as boosting and bagging. The Kaggle Ensembling Guide is also a good reading.

DeepStack walkthrough with the CIFAR-10 dataset

Ready for coding with a real example? In the next minutes I will show you how to easily build ensembles with DeepStack using the CIFAR-10 dataset. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. We will start creating simple CNNs for classifying the images and create ensembles with the single CNNs using the 2 ensemble strategies previously described. Let’s get started.

pip install deepstack==0.0.9

Now we can make use of the DeepStack interface and create instances of the class deepstack.base.Member which provide all the necessary logic for building the base-learners of the ensembles. In this demonstration we will start using the class KerasMember which inherits from Member but has additional logic for Keras Models . The rationale is to initiate the instance with a training and validation dataset for the meta-learner — and we already built a utility function for that!

Now you may be asking yourself, why are we initiating all 4 objects with the same train and validation datasets? Well, you don’t have to. Your keras Models could have been trained with different ImageDataGenerator and you could pass the data generators directly as an argument to the KerasMember. What is important here is that the class labels (y_train / y_val) are equal among the members, so that we can validate and compare the base-learners in a later step.

Now that we have all the necessary building blocks, let’s create our first ensemble.

And voilà, you should see a similar output like this:

model1 - Weight: 0.2495 - roc_auc_score: 0.9269
model2 - Weight: 0.4498 - roc_auc_score: 0.9422
model3 - Weight: 0.0031 - roc_auc_score: 0.9090
model4 - Weight: 0.2976 - roc_auc_score: 0.9135
DirichletEnsemble roc_auc_score: 0.9523

The weighted average ensemble performs ~1% better than the best single model w.r.t to the AUC as default score function. As you can see, models with a higher score also received higher weights. What happened under the hood? The fit() method optimizes the weights of the base-learners based on the performance of a target score function on the training dataset. Weight optimization search is performed with a greedy randomized search based on the dirichlet distribution on a validation dataset. Per default, the score function is the sklearn.metrics.roc_auc_score, but you can pass any score function to the constructors. The describe() method just prints the performance of the single models and of the ensemble on the validation dataset.

How about stacking?

model1 - accuracy_score: 0.6116
model2 - accuracy_score: 0.6571
model3 - accuracy_score: 0.5500
model4 - accuracy_score: 0.6062
StackEnsemble accuracy_score: 0.6989

This example takes a RandomForestRegressor from scikit-learn as meta-learner. The regression task is to optimize (predict) the output probability of the 10 classes based on the class probability output of the base-learners. But the classifier probabilities would also totally work here. DeepStack focus on flexibility and everything is up to you. This time we focus on the accuracy score and we could gain 4% accuracy with stacking when compared to the best model alone.

3rd Level Stacking with the scikit-learn StackingClassifier

In stacked generalization there is no limit in the number of stacking levels. However, a higher amount of levels do not necessarily guarantee better results. The best architecture can only be found by experimentation and by high variation between the base learners as discussed previously.

Scikit-learn 0.22 introduces a StackingClassifier and StackingRegressor allowing you to have a stack of scikit-learn estimators with a final classifier or a regressor. Nice! We can now leverage the scikit-learn Stacking interface for building a 3rd level meta-learner with DeepStack:

But why do I need DeepStack if scikit-learn is already supporting stacking? DeepStack is developed to be used in cases you have already an army of pre-trained models and wish to bundle (ensemble) them together creating an even more powerful model. I.e., you don’t create or train your base-learners with DeepStack. DeepStack is also generic and is not dependent on the library used for creating the base-learners (keras, pytorch, tensorflow, you name it).
The scikit-learn stacking API supports creating (training) base-learners from scratch with scikit-learn models — which is not our use-case here.

And this is just scratching the surface. For using DeepStack with the output of any your pyTorch, tensorflow, etc. models and train meta-learners for them check the class deepstack.base.Member. You can also specify any custom target scoring function for the ensembles. DeepStack also allow you to save and load your ensembles for later optimisation.

For the examples given above, we took a relatively simple CNN model trained on a few (25) epochs. In reality, more complex and well-trained models have been reported to reach up to 93.17% accuracy on CIFAR-10 with their weighted average ensembles reported to reach 94.12%. This 1% reduction in error-rate pushed the results beyond human classification accuracy.

Got excited about trying DeepStack? Then happy ensembling!

The Power of Ensembles in Deep Learning

Introducing DeepStack, a python package for building Deep Learning Ensembles

Short Introduction to Ensembles

DeepStack walkthrough with the CIFAR-10 dataset

3rd Level Stacking with the scikit-learn StackingClassifier

Written by Julio Borges