Ensembles in Machine Learning

A Tutorial with Python Examples and Code

Pádraig Cunningham

Published in

Towards Data Science

28 min readMar 22, 2022

people playing violin inside dim room — Photo by Larisa Birta on Unsplash

TL;DR

Ensemble methods are well established as an algorithmic cornerstone in machine learning (ML). Just as in real life, in ML a committee of experts will often perform better than an individual provided appropriate care is taken in constituting the committee. Since the earliest days of ML research, a variety of ensemble strategies have been developed with random forests and gradient boosting emerging as leading-edge methods in classification today. The main recommendations in this tutorial are as follows:

If the objective is to maximise accuracy gradient boosting is recognised as perhaps the leading supervised learning method in ML [16].
Random forests also achieve very good accuracy and have the added advantage of providing insight into the data through the feature importance mechanism.
It is worth understanding how stacking works because it has an emerging role in AutoML.

Introduction

It has been recognised since the early days of ML research that ensembles of classifiers can be more accurate than individual models. In ML, ensembles are effectively committees that aggregate the predictions of individual classifiers. They are effective for very much the same reasons a committee of experts works in human decision making, they can bring different expertise to bear and the averaging effect can reduce errors. This article presents a tutorial on the main ensemble methods in use in ML with links to Python notebooks and datasets illustrating these methods in action. The objective is to help practitioners get started with ML ensembles and to provide an insight into when and why ensembles are effective.

Research on ML ensembles dates back to the 1990s [25, 1, 6]. There have been a lot of developments since then and the ensemble idea is still to the forefront in ML applications. For example, random forests [2] and gradient boosting [7] would be considered among the most powerful methods available to ML practitioners today.

The generic ensemble idea is presented in Figure 1. All ensembles are made up of a collection of base classifiers, also known as members or estimators. When presented with a query these will each make a prediction and these predictions will be combined to produce an ensemble prediction. Different ensemble strategies vary on how exactly the base classifiers are trained and how the combination of predictions is achieved. For the purpose of this tutorial we organise ensemble methods into four categories:

Bagging: Bootstrap aggregation (bagging) refers to ensembles that achieve diversity in the estimators by training on random bootstrap resamples of the data. The aggregation of the outputs of these estimators is achieved by averaging or majority voting. Under this category we also consider ensembles based on random subspaces rather than random subsets. Random forests are also included in this category.

An image showing the generic ensemble scenario. — **Figure 1**: An ensemble is a group of classifiers (estimators) that produce predictions that are combined to produce an aggregate prediction. The different ensemble architectures are differentiated by how the classifiers are trained and how the aggregation is performed. Image by author.

Boosting: While bagging ensemble members are effectively trained independently, with boosting the estimators are trained in series with the training of a new member being influenced by overall performance so far. The estimator performance also determines their contribution in the aggregation process. Gradient boosting seeks to optimise the training of new estimators in tandem with the aggregation process.
Heterogeneous Ensembles:Given the plethora of classifier methods available why not use a variety of models to achieve diversity in an ensemble? While this may seem an obvious thing to do, heterogeneous ensembles have not received a lot of attention in the research literature and are not used much in practice. This topic is covered in below.
Stacking:(a.k.a Stacked Generalisation) Stacking treats the aggregation process itself as a learning process. A meta-model is trained to learn the relationship between the output of the estimators and the targets. The data used for this should not be used for training the estimators to begin with so stacking entails some data management challenges.

This article includes a section on each of these four ensemble architectures. In the next section we provide some historical context that stretches right back to the 18th century. Then we present a framework for categorising prediction error as this provides some important insight into when and why ensembles can improve accuracy. Before embarking on the in-depth discussions of the main ensemble categories, we provide a high-level summary of how ensembles work. The article finishes with some recommendations and conclusions.

Historical Context

In a sense, ensembles embody the idea that “two heads are better than one” and this has been known for some time. In fact the Condorcet Jury Theorem proposed by the Marquis of Condorcet in 1785 [3] claims that the decision making of a committee will be better than that of individuals. The Condorcet Jury theorem states that for a committee of n voters where each voter has a probability p of being correct and the probability of a majority of voters being correct is M then:

p > 0.5 ⇒ M > p
p > 0.5 ⇒ M →1.0 as n → ∞

That is, if an individual has a >50% chance of being correct, then an ensemble of such individuals will be even more likely to be correct. The first claim is generally true provided there is diversity in the ensemble. After all, if all members vote the same way the ensemble will be no better than an individual. So there must be diversity in the pool of voters — i.e. there must be some disagreement between their decisions.

The second claim, that larger committees are very likely to make correct decisions, is more problematic. The probability of a majority of voters being correct will increase as the ensemble grows only if the diversity in the ensemble continues to grow as well [15]. Eventually, new ensemble members will have voting patterns collinear with existing members — i.e. they will vote the same way. Typically the diversity of the ensemble will plateau as will the accuracy of the ensemble at some size between 10–50 members.

Components of Error

The benefits of ensembles are often described in terms of how the ensemble contributes to reducing error. In this context, error has three components:

Bayes Error:(a.k.a. irreducible error) If a model does not have access to all the features that influence an outcome then it will not be possible for that model to be 100% accurate. Imagine a model for predicting house prices that did not have information on the location of the property. Often the information that is missing will not be so obvious but the key point is that models will not be able to reduce error to zero. This remaining error is the Bayes error, sometimes called the irreducible error. So the Bayes error represents the limit of performance and will not be achieved by all models.
Bias:If a model does not achieve the Bayes error the additional error has two components, one of which is bias. In regression, bias might be evident if the mean (the tendency) of the distribution of the predictions is higher or lower than it should be (see Figure 2 left column). In classification, a particular class might be under or over represented in the predictions.

An illustration of errors due to bias and variance in 2D plots. — **Figure 2**: An illustration of the Bias and Variance components of error. This is a 2D regression task where the target is at the point (0,0). The target is shown in red and predictions for that target are shown in blue. Image by author.

Variance: In regression a model can have very low bias but still have high error (image on top right in Figure 2). The mean of the predictions is right but the predictions vary a lot. In classification, the model is not biased towards a single outcome but errors show up as misclassifications in multiple directions.

The ‘Low Bias High Variance’ example at the top right in Figure 2 shows the link between ensembles and the ‘Wisdom of Crowds’ idea as presented in the book by James Surowiecki [21]. This book opens with the ‘guess the weight of the cow’ example published in Nature in 1907 by Galton [8]. In this example 787 people enter a competition to guess the weight of a cow at a county fair. Galton analysed the estimates to show that, while they varied widely, the median estimate was within 1% of the correct weight. In the same way, in Figure 2 we can see how an ensemble might reduce the variance component of error by averaging many estimates. This is the main claim for ensembles, that they can reduce the variance component of error. Later we will see that boosting ensembles can in some circumstances also reduce the bias component of error.

Why do Ensembles Work?

Before examining the details of specific ensemble strategies it is worth looking at the general principles on which ensembles depend. Earlier we saw from the Condorcet Jury Theorem and subsequent work in political science that there are two important considerations:

Size: The more ensemble members the better — within reason. The accuracy of the ensemble will increase as more members are added until eventually new members don’t bring new information.
Diversity :In order for the accuracy of the ensemble to be better than the individual, there must be diversity in the ensemble members.

Two charts showing (a) the impact of ensemble size and (b) estimator accuracy on overall performance. — **Figure 3**: The impact of ensemble size and diversity on accuracy. On the left, accuracy increases as more estimators (ensemble members) are added. On the right, for an ensemble of 10 estimators, diversity can be increased by using smaller subsets of the data resulting in less training set overlap. Accuracy increases as estimators are less similar. Graphs by author.

The impact of these factors is illustrated in the first Ensembles-Preliminaries notebook. These results are presented in Figure 3. These results have been produced using the bagging ensemble implementation from scikit-learn on the wine dataset from the UCI Repository. The accuracy estimates were obtained using cross-validation. The impact of ensemble size is shown in Figure 3 on the left. With two estimators (members) the ensemble has an accuracy of 0.87. This increases to above 0.91 with seven estimators but there is no material increase with the addition of more estimators. This increase-then-plateau pattern is typical, although for other datasets the plateau might not be reached until significantly more estimators are added.

Illustrating the impact of diversity is less straightforward because diversity is less easy to quantify. In Figure 3 on the right, the strategy we use to manage diversity is to control the overlap in the datasets used to train the estimators. The ensemble size is always 10 and the first ensemble members are trained using 95% of the data sampled without replacement. So the estimators are all very similar to each other. Then the next ensemble is trained using 90% of the data and so on. So diversity is increased by training estimators with less and less of the available data allowing them to be less similar. Presumably these estimators become less accurate as less data is used but the ensemble accuracy increases as any loss in accuracy in the individual estimators is more than offset by the increase in diversity. Eventually the benefits tail off as it becomes more and more difficult to add diversity without being just plain wrong.

Quantifying Diversity

If diversity is such an important factor in determining the effectiveness of an ensemble it would be good to be able to quantify it. Krogh and Vedelsby [12] have shown the following very important relationship between error and ambiguity (diversity) in regression ensembles:

where E is the overall error of the ensemble over the input distribution, Ē is the average generalisation error of the ensemble components and Ā is the ensemble ambiguity averaged over the input distribution. Ē is a standard squared error estimation (L2 loss) and Ā is an aggregation (average) of individual ambiguities ā(x), the ambiguity of the ensemble of N estimators on a single input x:

where the terms in the summation are the square of the difference between the estimate from each estimator k and the average estimate. Thus the ambiguity is effectively the variance in the predictions coming from the ensemble members.

Unfortunately no such neat analysis exists for classifiers. Kuncheva and Whitaker [14] analyse 10 different diversity measures for classifier ensembles. They find that they all do a reasonable job of quantifying diversity but, after a fairly extensive evaluation, they conclude that there is no single winner. In the Kuncheva and Whitaker analysis they distinguish between pairwise and non-pairwise measures. To assess ensemble diversity using a pairwise measure, the measure is calculated for all pairs of N estimators and the average is taken. Their analysis considers four pairwise and six non-pairwise measures.

To provide some insight into how diversity measures work we provide two examples here, one of each type. Perhaps the most straightforward pairwise measure is the plain disagreement measure [23] (sometimes called the classifier output difference [24]). This measure simply counts the proportion of a test set on which the two classifiers disagree, a measure in the range [0,1]. The plain disagreement measure for two classifiers hᵤ and hᵥ is:

where m is the number of instances in the data set and the summation process counts the dissagreements between the two classifiers. That is, Diff(a,b) = 0, if a=b, otherwise Diff(a,b) = 1. The overall ensemble diversity would be the average of N×(N−1) of these measures. This plain disagreement measure is used in some of the evaluations in this article, see for example Figures 5 and 11.

By contrast a non-pairwise measure considers all ensemble members at once. At the level of a single test sample, we want to measure the diversity in a set of N predictions for that sample. The entropy of this set is an obvious choice. Then for a test set of m samples where there are C classes the entropy is [4, 14]:

where the P terms are the frequency of class k in the predictions for sample i — the more dispersion or randomness in the predictions the more diversity.

Bagging (and Variants)

The bagging ensemble idea was introduced by Breiman in 1996 [1]. Bagging works by bootstrap aggregation, hence the name. Diversity in the set of classifiers (Figure 1) is achieved by bootstrap sampling and then the predictions are aggregated by simple majority voting. Bootstrap sampling means sampling with replacement. The characteristics of bootstrap sampling are illustrated with a simple example at the beginning of the Python notebook on bagging. We create a list of 1,000 unique samples and then create a sub-sample of size 1,000 from this. If the sampling is done with replacement, we find that roughly 63% of samples get selected with some being selected multiple times. This means that ∼37% of samples are not selected in each ‘bag’ — these are the out-of-bag (OOB) samples. We will see later that these prove to be very useful.

To implement an ensemble such as is shown in Figure 1 using bagging, the ensemble members are trained with bootstrap samples from the training data. Typically these bootstrap samples are the same size as the full training set. In the right circumstances, as we will see, this sampling strategy will produce enough diversity to produce an effective ensemble. The Aggregation step is simply majority voting over the ensemble members with members having equal votes.

Plots showing that bagging produces improvements for unstable classifiers (Trees, Neural Nets) but not unstable ones (Naive Bayes, kNN). — **Figure 4**: The effectiveness of bagging on stable and unstable classifiers.(a) Bagging has less impact with k-NN, a stable classifier. (b) For five different classifiers, bagging only improves accuracy for the two unstable models (Trees and Neural Nets). Graphs by author.

In Figure 4 we see how bagging works on the wine dataset. These results come from evaluations in the bagging notebook. In Figure 4(a) we see that ensemble accuracy improves for the neural net ensemble as more estimators (i.e. models or classifiers) are added. However, a similar improvement is not achieved with the k-NN ensemble.

The reason for this is well understood, bagging only works for unstable classifiers [1, 20]. In this context an unstable classifier is one where minor changes in the model inputs (in this case the training data) produce a significantly different model. With bagging this instability is an advantage because it produces the diversity required for the ensemble to work. In Figure 4(b) we show results of bagging ensembles built with five different base models. Decision trees and neural nets are known to be unstable, whereas logistic regression, k-NN and naive Bayes are stable. Sure enough, bagging delivers benefits for the unstable models but not for the others.

We can use the plain disagreement measure to explore this further. Figure 5 shows colour maps of the plain disagreement measures for neural network and k-NN ensembles with five members each. The picture for the neural net ensemble is quite healthy with significant pairwise disagreement between the estimators. The situation with the k-NN ensemble is quite different with some of the estimators only differing on a few percent of the samples. So k-NN is stable in the face of significant changes in the training data. This means that bagging doesn’t work with k-NN. However, there are other possible strategies, the most popular of which is random subspacing.

Colourmaps showing that there is more disagreement in bagging ensembles of neural nets compared with kNN. — **Figure 5**: These colour maps show the plain disagreement measure for two ensembles with five members. The neural network ensemble is quite diverse while there is little diversity in the k-NN ensemble. In these colour maps lighter is better, i.e. greater disagreement. Graphs by author.

Random Subspacing

In the example shown in Figure 6, bagging would entail selecting different subsets of the rows in the original dataset. By contrast random subspacing [11, 10] will select different subsets of the columns as shown in the figure. We have seen in Figure 4 that bagging has no benefit when the base classifier is stable. The colour maps in Figure 5 suggest that this is because different bootstrap training sets result in little variation in the output of the estimators.

Figure 6 shows the random subspacing strategy for ensuring diversity. Instead of subsampling the rows, the columns are subsampled. If all the features represents the full feature space, the different estimators are trained on random subspaces. It may not be immediately obvious but this will normally be a more decisive mechanism for achieving diversity. The different estimators will reflect different views on the data. Some estimators will not have access to the most predictive features and will need to rely on the less predictive features. Research shows that random subspace strategies do produce improvements for stable estimators [10, 11, 13].

A schematic representation of how random subspace ensembles work. — **Figure 6**: An ensemble based on random subspacing — the different ensemble members are trained using random subsets of the features. Graphs by author.

The impact of random subspacing in combination with k-NN on the wine dataset is shown in Figure 7. The original neural network and k-NN results from Figure 4(b) are shown again in Figure 7 but with results for ensembles based on random subspacing added. Whereas bagging was not effective with k-NN, random subspacing does produce some improvement. The results across multiple model combinations shown in Figures 4 and 7 suggest that a cross-validation accuracy of 97% to 98% may be the best that is achievable with that dataset, i.e. the Bayes error is roughly 2%.

A barchart comparing the performance of Random Subspace, Bagging and single classifiers. — **Figure 7**: The impact of random subspacing in combination k-NN. Whereas bagging produces no improvements with k-NN, random subspacing does. Both strategies are equally effective with neural networks. Graph by author.

Bagging in scikit-learn

One justification for considering bagging and random subspacing in the same category is that they are implemented in a single integrated framework in scikit-learn. The BaggingClassifier framework has the following parameters:

base_estimator: the component classifier type.
n_estimators: number of estimators.
max_samples: number (or proportion) of samples used to train each estimator.
bootstrap: a binary feature indicating whether sampling is bootstrap or not.
max_features: the number (or proportion) of features used in each estimator.

The correct combinations of these parameters will indicate whether the ensemble uses bagging or random subspacing. If max_samples and max_features are set to 1 and bootstrap is set to True we have bagging. If for example max_samples is set to 1, max_features is set to 0.5 and bootstrap is set to False we have random subspacing. We could also have a combination of bagging and random subspacing; this is what happens with random forests.

Random Forests

Random forests represent the state-of-the-art in ensembles based on feature and sample selection. As the name suggests, the base estimator is a decision tree. Random forests use both bootstrapping and random spacing to ensure diversity [2]. The number of trees is inclined to be relatively large because estimates of generalisation accuracy and feature importance are often generated as a side effect of the ensemble building process.

For a dataset D of m instances described by a set of features F the strategy is to build a lot of trees, typically 100 to 1,000 trees. Then, for each tree:

As in bagging, for each ensemble member the training set D is sub-sampled with replacement to produce a training set of size m.
Where F is the set of features that describes the data, q ≪|F|is selected as the number of features to be used in the feature selection process. At each stage (i.e. node) in the building of a tree q features are selected at random to be the candidates for splitting at that node.

Given that we are working with a model that has a lot of hyperparameters, it is interesting to examine the default parameters in scikit-learn. The default number of trees is 100. The default bootstrap set size is m. The tree pruning parameters are set so that there is no pruning, i.e. trees are bushy. q = √|F| by default. In the original paper on random forests Breiman highlights a number of benefits in addition to prediction accuracy [2]. It would be generally accepted that the OOB estimate of generalisation accuracy and variable importance scores are significant benefits of the random forest approach.

OOB Generalisation Estimate: One of the highlighted benefits of random forests is the potential to get very good estimates of generalisation accuracy without performing cross validation or holding back data for testing. In fact this OOB strategy might be viewed as an implicit cross-validation. The process is as follows:

1. Identify all samples that are OOB in some trees
2. For each OOB sample:
   2.1. Find the trees not trained using that sample   
   2.2. Generate predictions using those trees for that sample     
   2.3. Determine the majority prediction and compare with the true     
        value 
3. Compile the OOB error on all the OOB samples

This OOB strategy is indeed effective for getting accurate estimates of generalisation accuracy. A simple illustration of this is presented in Figure 8(a). In this evaluation we hold back one third of the wine dataset for testing and train random forests on the training data. This process is repeated for random forests with from 10 to 100 estimators. Each step is repeated 50 times to ensure reliable results. This code is available in the Ensembles-RandomF notebook. The evaluation shows that once the ensemble has 60 estimators the generalisation estimate is in line with estimate from repeated hold-out testing. It is worth remembering that the default random forest size in scikit-learn is 100.

Graphs showing how random forest can produce estimates of generalisation accuracy and feature importance. — **Figure 8**: Two benefits of random forest are (a) OOB estimates of generalisation accuracy, (b) estimates of feature importance. Graphs by author.

Feature Importance: The principle underlying the random forest feature importance mechanism is to add noise to a variable to see what happens. The actual details are closely connected to the OOB strategy already described. For each feature in turn we permute the value for that feature in the OOB samples and re-run those samples through the trees in which those samples are OOB. The policy for permuting the values is to shuffle the values for a particular feature, i.e. the values in the column for that feature are shuffled. The generalisation error estimates before and after permuting are compared. If a variable is important this shuffling will have a significant impact.

Variable importance scores for the wine dataset are shown in Figure 8 (b). We see for instance that the Ash feature has a low feature importance score because it has a minimal impact on error. By contrast Proline seems to be a very important feature. This proves to be a very effective mechanism for assessing variable/feature importance because features are evaluated in context.

Boosting

Boosting refers to the way an ensemble can ‘boost’ a weak learner into an arbitrarily accurate strong learner. In PAC learning theory [19] a weak learner is one that is only slightly better than random guessing. A strong learner can achieve an arbitrarily low error rate given enough training data. In boosting, the weak learner is typically a decision stump, a decision tree with just one decision node as can be seen in Figure 9(c).

Whereas with bagging the ensemble members are trained independently, with boosting the estimators are trained in series with the performance of the estimator k influencing the training of estimator (k+ 1). The key innovation is to focus on the misclassified examples so that they are up-sampled when the next estimator is trained. This focus on where the errors lie gives boosting the potential to reduce the bias component of error as well as the variance. While this boosting idea is a general principle for training ensembles, the specific implementation that first became popular is AdaBoost introduced by Freund and Schapire [6]. The overall AdaBoost algorithm to train an ensemble of N estimators from a dataset of m examples is as follows:

1. For estimator 0 assign an equal weight of 1/m to all training   
   examples: D0(i) = 1/m.
2. FOR each k of the N estimators to be trained:
   2.1. Randomly sample l examples from the full training set with 
   replacement, based on the current weights.
   2.2. Train estimator hₖ on this sample.
   2.3. Identify examples misclassified by this estimator, calculate 
        the error εₖ.
   2.4. Calculate the weight αₖ for this estimator based on εₖ:

2.5. Increase weights for misclassified examples, decrease   
        weights for other examples.

3. Output final model based on all N estimators 
   (e.g. a majority voting model)

Depending on how the sampling is implemented (step 2.1. above) it may be necessary in step 2.5 to normalise the updated weights to ensure that a proper distribution is maintained, i.e. the weights sum to 1.

To illustrate the details of boosting in operation we provide an example on the simple Athlete Selection dataset which has just two features. The code for this example is available in the Ensembles-Boosting notebook. The two features are Speed and Agility and the class labels are Selected / Not Selected — see Figure 9(a).

A 2D representation of boosting in action. — **Figure 9**: A simple boosting example with two estimators. (a) The training data, with the decision surfaces for the two decision stumps shown in green. Two samples misclassified by Est 0 are highlighted in orange. (b) The sample weights for the two estimators. © The two decision stumps (estimators). Image by author.

The figure shows two estimators which are decision stumps. Est 0 partitions on Speed with a threshold of 5. This misclassifies two of the 13 training examples, x15 and x11; these are highlighted in orange in Figure 9(a). In Figure 9(b) we see the impact this has on the weights for Est 1. The weights for the misclassified examples are significantly higher than those that have been correctly classified. If the ensemble is set to have just two members then the two decision stumps in Figure 9(c) would be the estimators and their weights would be determined using the equation in step 2.5.

Gradient Boosting

In recent years gradient boosting has emerged as perhaps the most powerful prediction algorithm in ML [16]. The idea with gradient boosting is to use gradient descent to optimise the parameters of new estimators as estimators are added in boosting. We can describe boosting in very general terms using the following equation:

where ŷᵢ is the estimate for the true outcome yᵢ, αₖ and pₖ are the weight and parameters for the estimator k. This is valid for regression where yᵢ is a numeric value. It is also valid in a binary classification scenario where the class labels are [-1,+1] if we take the sign of the summation as the predicted label. In these terms, boosting is an additive model where new estimators are trained to compensate for problems with the earlier estimators.

When adding a new estimator the objective is to eliminate the difference (error) between the true value and the existing ensemble estimate.

So the objective in adding a new estimator is to fit it to this ‘residual’ yᵢ−Fₖ(xᵢ). For instance when dealing with root mean squared error (RMSE) in regression this loss function is:

On the training data, the overall loss is:

If this loss function is differentiable as it is for instance with RMSE then we can use gradient descent to update the αₖ and pₖ parameters. The details of this gradient descent training depend on the loss function and the nature of the base estimator (e.g. logistic regression, decision tree, decision stump) [7, 18].

A gradient boosting implementation is provided as part of the standard scikit-learn distribution. Here we provide an informal assessment of the performance of that implementation against the main ensemble methods implemented in scikit-learn. We compare against bagging, random subspace, random forest and standard adaboost. In this case we do not use the wine dataset because with accuracies around 98% there is not much ‘headroom’ for improvement. Instead we use a hotel review dataset where the objective is to predict if users will consider reviews to be useful. The dataset and the code for the evaluation are available on GitHub.

Bar chart comparing the performance of all ensemble methods covered here. — **Figure 10**: A comparison of five ensemble methods on the hotel review dataset. The base estimators for all methods are decision trees. On this single dataset adaboost does best with the random subspace ensemble performing worst. Graph by author.

The results are shown in Figure 10. All ensembles are scored using repeated 10-fold cross validation (10 repetitions). All ensembles have 100 estimators and default parameters are used, i.e. there is no attempt at parameter tuning. There is not too much to choose between the models. Random subspace does worst at 70%. Adaboost does best with gradient boosting and random forest coming close behind within 1%. There is considerable scope for parameter tuning for all these models so the overall ranking might shift if the models were tuned.

Heterogeneous Ensembles

When students of ML are introduced to the ensembles idea for the first time it is often assumed that ensembles would be a heterogeneous collection of the basic classifier models, e.g. Naive Bayes, k-NN, decision trees, etc. Given that the ensemble members need to be diverse, what better way to achieve diversity than to use different model types? Indeed, if there is variety in the data, some models may succeed on some data samples where the other models completely fail. While there has been some research on heterogeneous ensembles it is by no means a mainstream topic. The best successes for heterogeneous ensembles are in specialised areas such as malware detection [17] or ML on data streams[24].

One explanation for the lack of prominence of heterogeneous ensembles is that there are easier ways to achieve diversity. We provide a simple demonstration of that here. In Figure 11 we compare a heterogeneous ensemble of seven classifiers with a bagging ensemble of decision trees of the same size on the hotel reviews dataset. Because our objective is to show the role of diversity/disagreement we present results from a single hold-out evaluation. The performance of the seven estimator types are shown in Figure 11(a). The most accurate are logistic regression and the support vector classifier (SVC) with the least accurate being the decision tree. However, the heatmap in Figure 11(c) shows that the decision tree has the best pairwise disagreement so it may be quite useful. The heterogenous ensemble is moderately effective because the ensemble accuracy (red bar) is better than the average of the component estimators. However, the ensemble is not better than logistic regression or SVC so a model-selection strategy rather than an ensemble might have been more effective.

Bar charts and heat maps comparing heterogenous and bagging ensembles. — **Figure 11**: A comparison of a heterogeneous ensemble and a bagging ensemble of the same size. The bagging ensemble achieves higer accuracy on the test set, probably because there is more diversity in the estimators. In the colourmaps brighter is better. Graphs by author.

By contrast the bagging ensemble Figure 11(b & d) is very effective with the ensemble outperforming the component estimators. The two colour maps showing the pairwise disagreement between ensemble members provide some insight into what is going on. The lighter colours for the bagging ensemble indicate better diversity among the ensemble members. In the heterogeneous ensemble, logistic regression, SVC and quadratic discriminant analysis (QDA) are all good classifiers but they prove to be quite similar. Furthermore, with bagging it is straightforward to add new ensemble members. With the heterogeneous strategy we have ‘run out of road’, we need to come up with something new if we want to add new estimators.

One strategy to allow for more ensemble members in a heterogeneous ensemble would be to relax the criterion for heterogeneity. Variety can be achieved by using different variants on a model, for example neural networks with different numbers of units in the hidden layer [9]. By varying models using different sets of hyper-parameters the number of estimators in heterogeneous ensembles can be considerably increased. This idea is explored further in the next section in the context of stacking.

So while heterogeneous ensembles may seem the obvious way to achieve diversity, the variety of models does introduce added software engineering complexity and there are easier ways to achieve the same levels of diversity with a single model type. We will see in later that heterogeneous ensembles may have a role in the AutoML movement if the objective is to automate the overall ML pipeline without investing a lot of effort in model selection.

Stacking

In moving from bagging through boosting to gradient boosting, there is an increased focus on the aggregation phase (Figure 1) of the ensemble. Stacking brings this to what might be considered a logical conclusion by treating this step itself as a supervised learning task, i.e. a final estimator is trained to optimize the aggregation process (see Figure 12). While this might seem an obvious strategy there are a few important implementation issues to be considered.

The management of the training data requires careful consideration. It is important that the data used to train the the final estimator (the one that performs the aggregation) is not used to train the base estimators.

Should the final estimator use the output of the base estimators only or should it have access to the original input features as well? This is normally called pass through as shown in Figure 12.
If the base estimator can produce a probability (e.g. Naive Bayes) as an alternative to a crisp class label should this be used as the input to the final estimator?
Why stop at one level of stacking? Is there any merit in adding further layers?

We assess the merits of stacking in the evaluation presented in Figure 13. The code for this evaluation is available in the Ensembles-Stacking notebook. This evaluation uses 10-fold cross validation repeated 20 times. The two green bars show the performance of ensembles based on the seven heterogeneous ensemble members introduced earlier. With stacking the aggregation is done using logistic regression as the final estimator and there is a slight improvement in accuracy. These results are based on no pass through and crisp class labels passed to the final estimator. When we allow pass through and use ‘probabilistic’ outputs there is no improvement.

Bar chart comparing stacking and heterogenous ensembles. — **Figure 13**: A comparison of the accuracy of a basic heterogeneous ensemble and the stacking equivalent. the performance of stacking ensembles based on SVC variants is also shown. Graph by author.

So stacking does improve the performance of a heterogeneous ensemble but the availability of different model types is limiting the size of the ensemble. We have already mentioned that using different hyper-parameters within a single model type might overcome this problem. The red bars on the right in Figure 13 show the potential for this strategy. The single model considered is an SVC and variety is introduced to the ensemble members by randomly sampling the hyper-parameters. The options considered are as follows:

Kernel:one of rbf, linear or poly.
C:one of [0.05, 0.1, 0.2],
Gamma: one of [0.1, 0.5],

Here C is the regularization parameter for the SVC and Gamma is the kernel parameter when the rbf and poly kernels are chosen. The blue bar in the figure shows the performance of a single SVC model selected at random from those used in the stacking ensembles. Clearly selecting hyper-parameters at random can sometimes result in poor models. However, the stacked ensembles with 7 and 20 estimators do produce good performance. So this hyper-parameter strategy allows us to generate stacking ensembles with many base estimators. It is interesting to note that the stacked ensemble of size 20 is no better than the one of size 7. This may be due to a tendency to overfit in the final estimator.

In setting up these stacking ensembles based on hyper-parameter selection a lot of manual tuning was required because some hyper-parameter combinations produce very poor estimators. So our experience is that stacking is harder to get right than bagging, boosting or random forest. It may be for this reason that stacking has received less attention that these other ensemble alternatives. Nevertheless stacking does seem to have a role in AutoML so this section concludes with a short introduction to that topic.

AutoML

AutoML refers to automated machine learning — a movement that aspires to help developers with limited ML experience to build effective ML models. The objective is to automate all aspects of the ML piepline including, data preparation, model selection and model tuning [22]. In this context researchers at Amazon present stacking and heterogeneous ensembles as a solution for automating model selection [5]. Their framework called AutoGluon seeks to avoid the tasks of model and hyper-parameter selection by using stacking to learn the best overall architecture. So while there may not be many ML practitioners explicitly working with stacking, it is entirely possible that it will be in widespread use behind the scenes in AutoML.

Recommendations and Conclusions

The objective for this tutorial was to provide a practical tutorial on ensembles in ML. We have covered the main ensemble architectures and discussed how ensembles can be effective in improving on the accuracy of individual classifiers. In the Appendix we provide links to Python code that will allow experimentation on all the ensemble architectures covered here. Our main recommendations and observations can be summarised as follows:

If the objective is to maximise accuracy gradient boosting is recognised as perhaps the leading supervised learning method in ML [16].
Random forests also achieve very good accuracy and have the added advantage of providing insight into the data through the feature importance mechanism.
It is worth understanding how stacking works because it has an emerging role in AutoML.

Appendix: Python Code

The GitHub repository (https://github.com/PadraigC/EnsemblesTutorial) associated with this tutorial contains the following Python Notebooks:

Ensembles-Preliminaries: Code demonstrating the impact of ensemble size and diversity on accuracy.
Ensembles-Bagging: Code for bagging and random subspace ensembles.
Ensembles-RandomF: Using random forest to generate feature importance scores and OOB estimates of generalisation accuracy.
Ensembles-Boosting: A simple AdaBoost example to illustrate the internal workings.
Ensembles-GBoost: Gradient boosting compared with other ensemble methods.
Ensembles-Hetero: A heterogeneous ensemble with 7 estimators complared with a bagging ensemble.
Ensemble-Stacking: A comparison of a heterogeneous with some stacking alternatives.

References

L. Breiman, Bagging predictors (1996), Machine learning, 24(2):123–140.
L. Breiman, Random forests (2001), Machine learning, 45(1):5–32.
Marquis De Condorcet, Essai sur l’application de l’analyse a la probabilite des decisions renduesa la pluralite des voix (1785), de l’Imprimerie Royal, Paris.
P. Cunningham and J. Carney, Diversity versus quality in classification ensembles based on feature selection (2000), In European Conference on Machine Learning, pages 109–116. Springer.
N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, A. Smola, Autogluon-tabular: Robust and accurate automl for structured data (2020), arXiv preprint arXiv:2003.06505.
Y. Freund, R. Schapire, Experiments with a new boosting algorithm (1996), In ICML’96, pages 148–156.
J.Friedman, Greedy function approximation: a gradient boosting machine (2001), Annals of statistics, pages 1189–1232.
F. Galton, Vox populi (1907), Nature, 75(1949):450–451.
L.Hansen and P. Salamon, Neural network ensembles (1990), IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001.
T. Kam Ho, Nearest neighbors in random subspaces (1998), In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 640–648. Springer.
T. Kam Ho, The random subspace method for constructing decision forests (1998), IEEE transactions on pattern analysis and machine intelligence, 20(8):832–844.
A. Krogh and J. Vedelsby, Neural Network Ensembles, Cross Validation, and Active Learning (1995), Advances in neural information processing systems 7, 7:231.
L. Kuncheva, J. Rodríguez, C. Plumpton, D. Linden, and S. Johnston, Random subspace ensembles for fRMI classification (2010), IEEE transactions on medical imaging, 29(2):531–542.
L. Kuncheva and C. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy (2003), Machine learning, 51(2):181–207.
K.Ladha, The Condorcet jury theorem, free speech, and correlated votes (1992), American Journal of Political Science, pages 617–634.
A. Mangal, N. Kumar, Using big data to enhance the Bosch production line performance: A Kaggle challenge (2016), In 2016 IEEE International Conference on Big Data (Big Data), pages 2029–2035. IEEE.
E. Menahem, A. Shabtai, L. Rokach, Y. Elovici, Improving malware detection by applying multi-inducer ensemble (2009), Computational Statistics & Data Analysis, 53(4):1483–1494.
A. Natekin, A. Knoll, Gradient boosting machines, a tutorial (2013), Frontiers in Neurorobotics, 7:21.
R.Schapire, The strength of weak learnability (1990), Machine Learning, 5(2):197–227.
M. Skurichina, R. Duin, Bagging, boosting and the random subspace method for linear classifiers (2002), Pattern Analysis & Applications, 5(2):121–135.
J. Surowiecki, The wisdom of crowds, (2005), Anchor.
C. Thornton, F. Hutter, H. Hoos, K. Leyton-Brown, Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms, (2013), In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 847–855.
A. Tsymbal, M. Pechenizkiy, P. Cunningham. Diversity in search strategies for ensemble feature selection (2005), Information Fusion, 6(1):83–98.
J.van Rijn, G. Holmes, B. Pfahringer, J. Vanschoren, The on-line performance estimation framework: heterogeneous ensemble learning for data streams, (2018), Machine Learning, 107(1):149–176.
D. Wolpert. Stacked generalization (1992), Neural networks, 5(2):241–259.