The world’s leading publication for data science, AI, and ML professionals.

Ensemble Averaging – Improve machine learning performance by voting

Ensemble averaging utilizes results from multiple models and combining them to produce an even better predictive model, as opposed to just…

Ensemble averaging utilizes multiple model predictions to develop robust performance

Photo by Parker Johnson on Unsplash
Photo by Parker Johnson on Unsplash

Many hands often make light work. Surprisingly, the same observation can be made in the field of machine learning. Ensemble averaging is the technique of engineering multiple different models i.e. Linear Regression or Gradient Boosted Trees and allow them to form an opinion towards the final prediction.

The technique can be easily applied following these three steps:

  • Develop multiple predictive models that are each capable of making their own predictions
  • Train each of the predictive models using the same set of training data
  • Predict the result by using each and every models and average their values

This may seem oddly easy to be implemented, will it really yield a better result than compared to a single independent predictive model? In fact, an ensemble of models has frequently shown better results than any individual model. Why so?


Why ensemble averaging works?

Bias and Variance

To understand why it works, we first look at the two most important properties of a machine learning model: bias and variance.

Code output plot by author
Code output plot by author

Bias are essentially the assumptions made by a model to map the given inputs to the outputs using the target function. A simple Ordinary least squares (OLS) regression, for example, is a model that has a high bias. A model with high bias allows itself to learn the relationship between input and output easily. However, they always make more assumptions about target functions and therefore have lower flexibility. Therefore, models with high bias often have a problem of underfitting (Image on the left).

The variance of the model on the other hand measures how the performance of the model on different training data. A decision tree does not make assumptions about the target function. It is hence highly flexible and often has low bias. However, with the flexibility, it tends to learn all the noise that was originally used to train the model. These models therefore introduce another problem – high variance. This on the other hand often leads to a problem of overfitting (Image on the right).

Photo by Jon Tyson on Unsplash
Photo by Jon Tyson on Unsplash

The bias-variance dilemma therefore exists in the world of machine learning as there is conflict in trying to minimize the bias and variance at the same time. An increasingly complex model often comes with low bias and high variance and a relatively simple model often comes with high bias and low variance. The dilemma therefore describes the difficulty for a model to learn the optimal relationship between input and output and at the same time generalize beyond the original training set.


Overview of the Voting Machine

An ensemble of models works just like the voting machine. Each ensemble member contribute equally by making a vote using their predictions. Hence, it introduces a concept of diversity when each and every of the models are ensembled. This sense of diversity will reduce the variance and improve the ability to generalize beyond the training data.

Let’s pretend that the eight cups of coffee in the image below each represent different predictive models. Each and every of them are on a different scale of taste (some are sweeter while some are slightly bitter). We cannot be sure that the coffee we use to serve a customer (predict a result) is suitable as we do not know the customer beforehand. Our main objective here is to come up with a new type of coffee that tastes nice and can fit better to the taste of most of the population.

Photo by Nathan Dumlao on Unsplash
Photo by Nathan Dumlao on Unsplash

What we can do here is, by mixing the eight cups of coffee, we come up with a new blend of coffee that keeps the touch of caramel, chocolate and bitterness from every coffee. This coffee should in theory be able to serve a wider range of customers compared to for example a pure espresso. This is because the original limitations of bitterness in an espresso that may not fit some customers was averaged out by the blend.

Definition of diversity ?

The precise definition of diversity however, is not defined anywhere. However, it has been empirically shown that the performance of an ensemble is better than individual models most of the time. As data science is a field which requires empirical evidence, being able to work most of the time is actually more convincing than you think. It means that it is capable of solving the bias-variance dilemma, reducing the variance by using an ensemble of models without the expense of increasing bias.

Looking at the problem from a financial perspective, the concept works pretty similar to Portfolio Diversification. If each and every individual stocks from the portfolio are independent or has low correlation. The unsystematic risk (variance or limitation of each model) can be mitigated through diversification while the returns of each individual component are not altered.


Python demonstration

In this article, I will demonstrate an implementation of ensemble averaging on both Regression problem and Classification problem.

Photo by Chris Ried on Unsplash
Photo by Chris Ried on Unsplash

Classification

First, we import some necessary modules from Scikit Learn.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier 
from sklearn.neighbors import KNeighborsClassifier

We use make_classification() from Scitkit Learn to simulate a binary classification problem with 50 features and 10000 samples.

The simulated data is further split into train and test sets.

X, y = make_classification(n_samples=10000, n_features=50, n_redundant= 15, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=43)

Now, we train several independent models to predict the classification problem. In this article, we will be using Decision Tree Classifier, KNN and Naive Bayes.

models = [('Decision Tree', DecisionTreeClassifier()),
         ('KNN', KNeighborsClassifier()),
         ('Naive Bayes', GaussianNB())]
for name, model in models:
    model.fit(X_train, y_train)

    prediction = model.predict(X_test)
    score = accuracy_score(y_test, prediction)
    print('{} Model Accuracy: {}'.format(name,score))
Code output by author
Code output by author

After fitting the models, we can observe that the models can already do pretty well on their own. Let’s see the result after we implemented Ensembling Techniques using VotingClassifier from Scitkit Learn.

ensemble = VotingClassifier(estimators=models)
ensemble.fit(X_train, y_train)
prediction = ensemble.predict(X_test)
score = accuracy_score(y_test, prediction)
print('Ensemble Model Accuracy: {}'.format(score))
Code output by author
Code output by author

VotingClassifier implements ensembling of the three models by selecting the predicted class labels using majority rule voting. For example, if two of the models predicted for Class A for an instance while only one model predicted for Class B, the final result of the prediction will be Class A.

From this experiment, we can see a noticeable improvement in the predictive performance after we implement ensemble averaging.


Regression

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

The technique can also be implemented on regression problems. Similar to how we introduced it in the classification problem, we first import some necessary modules from Scitkit Learn.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import VotingRegressor

We use make_regression() this time to simulate a regression problem with 10 features and 10000 samples.

The simulated data is again further split into train and test sets.

X, y = make_regression(n_samples=10000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=43)

We train the Support Vector Regressor, Decision Tree Regressor, KNN to predict the regression problem.

models = [('Support Vector', SVR()),
         ('Decision Tree', DecisionTreeRegressor()),
         ('KNN', KNeighborsRegressor())]
score = []
for name, model in models:
    model.fit(X_train, y_train)

    prediction = model.predict(X_test)
    score = mean_squared_error(y_test, prediction, squared = False)
    scores.append(score)
    print('{} Model RMSE: {}'.format(name,score))
Code ouput by author
Code ouput by author

As usual, we now implement Ensembling Techniques using VotingRegressor from Scitkit Learn.

ensemble = VotingRegressor(estimators=models)
ensemble.fit(X_train, y_train)
prediction = ensemble.predict(X_test)
score = mean_squared_error(y_test, prediction, squared = False)
print('Ensemble Model RMSE: {}'.format(score))
Code output by author
Code output by author

VotingRegressor works quite similarly with VotingClassifier excepts that it implements ensembling simply by averaging the individual predictions to form a final prediction.


Extra handy tips

In both VotingRegressor and VotingClassifier, we can utilize the parameter "weights" to change the ensemble outcome. The "weights" parameter assigns how much power each model has in the final outcome. For example, in this regression problem, we can see that KNN Model is relatively stronger at predicting the outcome than its peers. In this case, we can assign a stronger weight to its contribution to the final prediction. To do this, we simply pass a list of weights into the VotingRegressor function.

Photo by Graphic Node on Unsplash
Photo by Graphic Node on Unsplash

In fact, we can assign the models weight according to their rank relative to other models. By doing this, we give those who are better in predicting, a bigger say in the vote to the final outcome.

scores = [-x for x in scores]
ranking = 1 + np.argsort(scores)
print(ranking)
Code output by author
Code output by author

Using the RMSE from the models, we implement np.argsort to sort them according to their ranks. As np.argsort sorts from a smaller value we negate the scores first before we sort them. This way, a model with lower RMSE will have a higher weight. For example, KNN model is given a weight of 3 out of 6.

Let’s train the ensemble again and see if our results turn out better.

ensemble = VotingRegressor(estimators=models, weights = ranking)
ensemble.fit(X_train, y_train)
prediction = ensemble.predict(X_test)
score = mean_squared_error(y_test, prediction, squared = False)
print('Ensemble Model RMSE: {}'.format(score))
Code output by author
Code output by author

Unfortunately our ensemble model this time has a higher RMSE compared to normal averaging. You may ask then, how should we know which weight to set so that we get the perfect weights for each model?

Luckily, we can use a Stacked Regressor which will learn the best linear combinations from the predictors which is able to reach the final output again with an improved prediction accuracy. I will be elaborating more about Stacked Regressor in my next article.


Take-away

Ensemble Averaging is a great technique that can be used to improve model performance by introducing diversity to reduce variance. Note that although this technique frequently works, it does not work like a pill. Sometimes, you may see the ensemble of models work worse than an individual model.

However, given it’s ease in implementation, it is definitely worthwhile to be included in your Machine Learning pipeline.

Once again, thank you for taking more time out of your busy schedule to read this piece of writing.

Photo by Kelly Sikkema on Unsplash
Photo by Kelly Sikkema on Unsplash

Related Articles