The world’s leading publication for data science, AI, and ML professionals.

A Deep Dive into Stacking Ensemble Machine Learning – Part II

How to use stacking effectively in machine learning by implementing stacking in Python, Jupyter and Scikit-Learn

Photo by Tim Wildsmith on Unsplash
Photo by Tim Wildsmith on Unsplash

Background

In my recent article on stacking I explored what stacking is and how it works by building up a visual workflow of the 4 main steps involved in creating a stacking model.

A Deep Dive into Stacking Ensemble Machine Learning – Part I

However, to really understand what is going on inside a stacking model (and hence to understand when and how to adopt this approach in machine learning) it is essential to get inside the algorithms by writing some code.

This article builds a stacking model using the scikit-learn library and then Part III of this series will go into the final level of detail by developing the code to implement stacking from scratch.

Getting Started

To get started the required libraries need to be imported and some constants created that will be used in the main body of code …

Getting Some Data

We also need some data to work on and for that we are going to use a helper function to easily create the data as a single DataFrame with labelled columns.

Note that stacking can be applied to both regression and and to both binary and non-binary classification. This dataset has been created to build an example that models a binary classification i.e. where the target can have precisely two values …

Image by Author
Image by Author

The data has been split as follows –

  • X_train – training dataset features
  • y_train – training dataset labels / classes
  • X_val – validation dataset features
  • y_val – validation dataset labels / classes

The X_train, y_train dataset will be used to build the stacking model and X_val, y_val will be held back and used solely for the purposes of model evaluation.

Preparing for Stacking

Part I of this series of articles showed that stacking is a two level model with "Level 0" used to generate classification labels that become new features in the data and a "Level 1" model used to generate the final label predictions.

The following code sets this up …

Level 0 has been implemented as a dictionary of classification machine learning algorithms. Four different types of model have been chosen – logistic regression, random forest, XG boost and extra random trees.

The latter 3 are all high performing classifiers and the logistic regression provides some variation. High performing algorithms were deliberately chosen to see if the stacking model can successfully out-perform them.

A random forest has been selected for the "Level 1" algorithm (or "final estimator"). Experimentation showed it to be the highest performing individual algorithm on this dataset, hence it was selected as the level 1 model to drive higher performance from the stacking model.

Creating a Stacking Model using Scikit-Learn

scikit-learn provides an easy-to-use stacking implementation; we will be exploring the output and data in detail to understand exactly what it is doing.

A StratifiedKFold is passed as a parameter because scikit-learn uses folding in the level 1 model / final estimator to generate the final predictions –

"Note that estimators are fitted on the full X while finalestimator is trained using cross-validated predictions of the base estimators using cross_valpredict." (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html)

This approach was explored and explained in detail in Part I of the article.

Generating the Level 0 Predictions as Engineered Features

Image by Author
Image by Author

The next stage of stacking is to generate classification predictions using the Level 0 models and to append them as features to the training data.

Here is how it is done in the scikit-learn implementation …

Image by Author
Image by Author

It can be seen from the output that the scikit-learn StackingClassifier.fit_transform() method has appended the features to the predictions where passthrough=True. If passthrough were set to False the output would contain just the predictions.

So how did Scikit-Learn do it?

A little bit of experimentation shows exactly what the library is doing under the hood. The values in the logreg_prediction, forest_prediction, xgboost_prediction and xtrees_prediction columns were generated as follows –

Image by Author
Image by Author

And that is all there is to it! The scikit-learn stacking classifier is simply training each of the Level 0 classifiers sequentially and then using the second column of each classifiers predict_proba function to populate the new feature –

Note that the second column of predict_proba (sliced using [:, 1]) is simply the probability for class=1 (in the case of binary classification where the possible values are zero and one).

Also to note is that if the StackingClassifier sets passthrough=True the classification predictions are appended to the data whereas if passthrough=False is set the original features are removed leaving just the newly generated features / classification predictions.

My experimentation has shown that performance is significantly improved by leaving the original features in the datasets so the next code block will retrain the stacking classifier to retrain all of the original features …

Generating the Level 1 / Final Predictions for the Test Data

Image by Author
Image by Author

This next stage looks complicate but it is very easy to carry out this operation in scikit-learn.

This first line of code is not really necessary, we could just skip to stacking_model.predict(X_val) but I have included it to show the transformed data represented in the diagram above as the grey test data rectangle with the orange, blue and green predictions appended as new features.

Image by Author
Image by Author

This line of code certainly is necessary; it performs the Level 1 model predictions on the transformed test data that includes the Level 0 predictions as features.

y_val_pred is represented in the diagram above as the final purple rectangle.

array([0, 1, 0, ..., 0, 0, 0])

So was it all worth it?

It is time to evaluate the performance by comparing the accuracy of the full Level 0 and Level 1 stacking model with the performance we would have seen had we chosen one of the Level 0 models to make the predictions for y_val

Accuracy of scikit-learn stacking classifier: 0.8825
Accuracy of standalone logreg classifier: 0.737
Accuracy of standalone forest classifier: 0.8675
Accuracy of standalone xgboost classifier: 0.8645
Accuracy of standalone xtrees classifier: 0.8635

The full stacking model has an accuracy score of 88.25% vs. 86.75% for the highest performing standalone classifier which is a 1.5% improvement.

A 1.5% accuracy improvement may not seem huge, but consider the following –

  1. In a Kaggle competition a 1.5% improvement could represent a significant change in position on the leader board.
  2. If the business problem being solved were operationally critical (stock market predictions, predicting rocks vs. mines etc.) then every fraction of a percent counts.
  3. Other options for squeezing improvement like hyper-parameter tuning are likely to achieve smaller improvements.
  4. Each Level 0 model and the Level 1 model / final estimator could be individually hyper-parameter tuned which could yield more improvement when combined with stacking.

Can we do better?

Well, I think we can. During my investigation I came across some very clever code a data scientist had written that performed a lot of calculations to pick the optimum combination of Level 0 and Level 1 classifiers.

I really liked the idea but I could not help thinking that it was overly complex and that there must be a way of achieving the same outcome in less code which led to an experiment using grid searching to find the optimum combinations.

The first thing we need is a small function to help generate the various permutations of parameters that will be used in the grid search –

The idea is to use the power_set function to generate all of the useful combinations of Level 0 models to use as grid search parameters …

[['logreg', 'forest'],
 ['logreg', 'xgboost'],
 ['logreg', 'xtrees'],
 ['forest', 'xgboost'],
 ['forest', 'xtrees'],
 ['xgboost', 'xtrees'],
 ['logreg', 'forest', 'xgboost'],
 ['logreg', 'forest', 'xtrees'],
 ['logreg', 'xgboost', 'xtrees'],
 ['forest', 'xgboost', 'xtrees'],
 ['logreg', 'forest', 'xgboost', 'xtrees']]

Level 1 consists of a single model, so the grid search parameters will be a simple list of single objects …

['logreg', 'forest', 'xgboost', 'xtrees']

Armed with these building blocks the entire optimisation of the stacking model can be realised in just 3 lines of code after the grid search parameters have been defined.

Note that the PredefinedSplit forces the grid search to use the training and test data in the same way as the standalone stacking model so that the results are directly comparable …

Image by Author
Image by Author
Best accuracy score:  0.8895
Best estimators:  ['logreg', 'forest', 'xgboost', 'xtrees']
Best final estimator:  RandomForestClassifier(random_state=42)
Best passthrough:  True
Best stack method:  predict_proba

And the results are in!

The optimal configuration is to include all 4 classifiers as Level 0 models but if we use a RandomForestClassifer instead of a ExtraTreesClassifier for the Level 1 model we squeeze even more performance. This combination achieves a 88.95% accuracy which is now a full 2.2% better than the highest performing individual algorithm.

It is also very interesting to note that the optimum choice of Level 0 models includes the LogisticRegression classifier which has a much lower performance (73.7% accuracy) than the others. This just goes to show that including a diverse range of models including lower performers can drive higher performance from the overall ensemble.

Conclusion

In conclusion stacking does work when it comes to improving the accuracy of both classification and regression machine learning algorithms.

It is unlikely to yield accuracy improvements in the 5%-10% range but the improvements could still be significant depending on the context of the business or competition problem being solved.

Stacking should be used with due consideration; it will take more time and will add complexity. If the additional complexity and the associated difficulty with explaining how an algorithm calculated its answers is a good trade-off for < 5% improvement in accuracy then it will be worth using stacking.

This article has provided a simple, low-code, worked example using the scikit-learn implementation of stacking and explained in detail how it works which could help data scientists to decide when to use and when not to use stacking in their solutions.

Thank you for reading!

If you enjoyed reading this article, why not check out my other articles at https://grahamharrison-86487.medium.com/? Also, I would love to hear from you to get your thoughts on this piece, any of my other articles or anything else related to Data Science and data analytics.

If you would like to get in touch to discuss any of these topics please look me up on LinkedIn – https://www.linkedin.com/in/grahamharrison1 or feel free to e-mail me at [email protected].

If you would like to support the author and 1000’s of others who contribute to article writing world-wide by subscribing, please use the following link (note: the author will receive a proportion of the fees if you sign up using this link at no extra cost to you).

Join Medium with my referral link – Graham Harrison


Related Articles