The world’s leading publication for data science, AI, and ML professionals.

Boosting Algorithms in Machine Learning, Part I: AdaBoost

Understanding the logic behind AdaBoost and implementing it using Python

Photo by Jeffrey Brandjes on Unsplash
Photo by Jeffrey Brandjes on Unsplash

Introduction

In machine learning, boosting is a kind of ensemble learning method that combines several weak learners into a single strong learner. The idea is to train the weak learners sequentially, each trying to do its best not to repeat the mistakes its predecessor made, eventually building a strong ensemble.

In this article, we will learn about one of the popular boosting techniques known as Adaboost and show how elegantly it allows each weak learner to pass on their mistakes to the next weak learner to improve the quality of predictions eventually.

We will cover the following topics in this article:

  • Understand what is an ensemble and its different types. That’s where we will define boosting.
  • Understand what is meant by a weak learner using an example since boosting involves fitting multiple weak learners.
  • Understand why there was a need for boosting algorithms.
  • Introduction to AdaBoost algorithm.
  • Implement the AdaBoost algorithm for binary classification in Python and look at the individual weak learners as well as the final predictions with some interesting visualizations.
  • Finally, we will wrap up.

In the opening sentence, I mentioned that boosting is a kind of ensemble learning that combines several weak learners into a single strong learner. Before digging in, we must be familiar with these terminologies first. So, let’s get straight into it!

What is an Ensemble?

An ensemble is a collection of multiple base models that can be aggregated together in such a way that the quality of the final predictions is much better than what an individual base model could have produced by itself.

In terms of the diversity of base models, ensembles can be of two types:

  • Homogeneous ensemble, in which the base learners are the same. For example, a random forest is an ensemble of multiple decision trees.
  • Heterogeneous ensemble, in which the base learners are different from each other. For example, we can train a support vector machine (SVM), a decision tree, and a neural network (NN) on the same classification dataset and then aggregate their predictions to get the final result.

In terms of how the base learners are trained and aggregated together, ensemble learning can be of three types:

  • Bagging (short for bootstrap aggregation), in which homogeneous base learners are trained in parallel on random sub-samples of the training data, and their predictions are combined using a deterministic aggregation method, e.g., majority voting for classification and numerical averaging for regression.
  • Boosting (originally called hypothesis boosting), in which homogeneous base learners are trained sequentially on the same training data, in such a way that each base learner tries to correct the mistakes of the previous base learner. It also follows a deterministic aggregation approach except that the individual predictions are weighted according to the accuracy of the base learners.
  • Stacking (short for stacked generalization), in which heterogeneous base learners are trained (often in parallel) on the same training data, but their predictions are not aggregated using a deterministic approach. Instead, a separate model, known as blender or meta learner is used for aggregation that takes the predictions from the base learners as inputs and generates the final prediction as its output.

What is a Weak Learner?

A weak learner is any model or algorithm that performs just slightly better than random guessing.

For example, let’s say Santiago wants to try his luck betting in horse racing, and for this, he needs to maximize his winnings by using some logic (or a rule of thumb). Sophisticated ML people call it a model. Anyways, assuming that he knows very little about horse racing, he can either bet on a random horse or use some common sense.

The following figure shows some of the options for him to consider.

Example of weak learners (Image by author)
Example of weak learners (Image by author)

In this example, the latter two options are slightly better than random guessing but are overly simplistic, and not very helpful on their own. That’s why these are called weak learners.

The question that arises from here is – "Can a weak learning algorithm that performs just slightly better than random guessing be boosted into an arbitrarily accurate strong learning algorithm?" [2]

Kearns and Valiant [2] were the first ones to pose this question and there were the following challenges for the researchers at that time:

  • How to present the training data to the weak learners such that they generate the most useful rules of thumb (i.e., models).
  • How to combine multiple rules of thumb (i.e., models) from different weak learners into a single, highly accurate prediction rule.

Since there can be multiple weak learners, one obvious choice is to use an ensemble method. But, which one? A good answer would be – more likely the one that would address the challenges we just discussed.

Boosting solves the above-mentioned challenges by combining low-accuracy rules of thumb to generate a highly accurate prediction rule. It is achieved by training the weak learners sequentially such that each weak learner tries to improve upon the mistakes of its predecessor.

Continuous research in this direction eventually led to the introduction of the AdaBoost algorithm by Freud and Schapire in 1995 [2], which is what we are going to discuss next.

AdaBoost

AdaBoost, short for Adaptive Boosting, is one of the earliest and most popular boosting methods. The main idea is to pay a bit more attention to the training instances that the predecessor underfitted [3]. This results in new predictors focusing more and more on the hard cases.

The first question that popped up in my head after reading this definition was: how does a weak learner know which training instances it should pay more attention to? And, how to even quantify attention so that it can be assigned to the training instances? 🤔

Well after reading further, it occurred to me that the idea is simple:

Which training instances get more attention?

  • Pay more attention to the training instances that were misclassified by the previous weak learner in the case of classification. For regression, more attention is paid to the training instances with large errors.

How to quantify attention?

  • Assign weights to the training instances to capture attention. The higher the weight, more the attention the training instance gets. These weights are adjusted after every iteration i.e., every time a new predictor is trained.

How to aggregate the final predictions?

  • Once all the predictors are trained, the ensemble makes predictions very much like bagging, except that the predictions from individual predictors are weighted according to their accuracy and then aggregated together.

The core principle of AdaBoost is to fit a sequence of weak learners (e.g. Decision Trees) on repeatedly re-sampled versions of the data. Each sample carries a weight that is adjusted after each training step, such that misclassified samples will be assigned higher weights. – Source

The Algorithm

If the training data has m samples, the AdaBoost algorithm (for classification) can be summarized as follows. By default, the predictor in AdaBoost is a _Decision Tree with maxdepth = 1 (also known as a stump).

  1. Initialize the weights. Each training sample gets the same weight to start with. Let the index of all training samples be denoted by i.
Initialization of sample weights (Image by author)
Initialization of sample weights (Image by author)
  1. Fit a decision tree stump on the currently weighted training data. Let the index of all decision tree stumps (predictors) be denoted by j.
  2. Calculate the weighted error rate of the current decision tree stump. This is the ratio of the sum of weights (when there is a misclassification) to the total sum of weights.
Calculation of error for the predictor (Image by author)
Calculation of error for the predictor (Image by author)
  1. Calculate the weight of the current decision tree stump.
Calculation of weight for the predictor (Image by author)
Calculation of weight for the predictor (Image by author)

The more accurate the predictor is, the higher its weight will be. If it is just guessing randomly, then its weight will be close to zero. The weight can also be negative if it is wrong most of the time i.e., less accurate than random guessing. This weight will be used in the weight update rule as well as in the final aggregation rule (we’ll see shortly).

Learning rate η is an important hyperparameter that can affect the final convergence. It controls the degree to which the weights of misclassified samples are increased (or boosted). Its default value is 1. The original AdaBoost algorithm did not use learning rate as a hyperparameter and it has been added afterwards as an enhancement.

  1. Update the weights of all the training samples. The following weight update rule is then used to boost the weights of misclassified samples for each training sample i = 1, 2, …, m.
Update rule for sample weights (Image by author)
Update rule for sample weights (Image by author)
  1. Normalize the weights as follows.
Normalize the sample weights (Image by author)
Normalize the sample weights (Image by author)

Repeat steps 2–6 for all the decision tree stumps. These steps are repeated n_estimators times which is another hyperparameter we can specify.

  1. Get the final prediction. Once all the decision tree predictors are trained individually, we can get the final prediction as follows.
Final prediction for a sample x (Image by author)
Final prediction for a sample x (Image by author)

Here N is the total number of decision trees we trained. The formula might look confusing at first, so let’s break it down:

  • For each class label k, calculate the sum of the weights α_j of all the decision tree predictors that predict x as belonging to class k. This is the summation part of the formula.
  • Find the class label k that has the maximum sum of weights. This is the argmax part of the formula.
  • Assign the class label k with the highest sum as the predicted class of sample x.

Hopefully, the underlying logic of the AdaBoost algorithm is clear now and we can move forward with the implementation.

Implementing AdaBoost for Binary Classification

In this section, we will implement the AdaBoost algorithm for classification on a linearly inseparable dataset with two features and a binary target variable. Let’s set up our Python notebook and import all necessary libraries first.

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_gaussian_quantiles
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Step 1: Generating the dataset

We will use make_gaussian_quantiles to create our dataset as follows.

X1, y1 = make_gaussian_quantiles(
    cov=2.0, n_samples=200, n_features=2, n_classes=2, random_state=1
)
X2, y2 = make_gaussian_quantiles(
    mean=(3, 3), cov=1.5, n_samples=300, n_features=2, n_classes=2, random_state=1
)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, -y2 + 1))

Following is what our dataset looks like:

Scatter plot of data (Image by author)
Scatter plot of data (Image by author)

Note: The link to the full code along with visualizations will be provided at the end of this article.

Step 2: Train-test split

We will use Scikit-Learn’s train_test_split to divide our dataset into – a training set (80%) and a test set (20%). Since we have 500 data points in total, the training set will have 400 data samples and the test set will have 100 data samples.

# 80-20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=42)

print(f"Size of training set is {X_train.shape[0]}")
print(f"Size of test set is {X_test.shape[0]}")

Cell Output:

The AdaBoostClassifier will be fit against the training set, and we will use the test set to see how well the final model performs on unseen examples.

Step 3: Fit the AdaBoostClassifier from Scikit-Learn.

We can specify the base estimator i.e., weak learner to be a decision tree classifier with max_depth = 1 as discussed above. This estimator is then passed to the AdaBoostClassifier along with n_estimators=100, which means the algorithm will fit 100 decision tree stumps on the given data.

weak_learner = DecisionTreeClassifier(max_depth=1)
clf = AdaBoostClassifier(estimator=weak_learner, 
                         n_estimators=100, 
                         random_state=42)
clf.fit(X_train, y_train)

Cell Output:

The original AdaBoost algorithm for classification was designed for binary classification, however, several enhancements paved the way for multiclass classification. Scikit-Learn implements a multiclass version of AdaBoost called SAMME (Stagewise Additive Modeling using a MultiClass Exponential loss function), so we can easily extend this example with multiclass data as well.

Step 4: Visualizing the results

Following are the plots for the decision boundaries as well as the stumps of the first five estimators out of the total 100 estimators that were trained by the AdaBoostClassifier.

Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author
Image by author

Although the individual decision trees look very basic and are not very accurate on their own, the final ensemble after aggregating the predictions of all the 100 stumps is quite impressive!

The plot of the decision boundary of the final ensemble is shown below.

Decision Boundary of the Final Ensemble and Training Set Predictions (Image by author)
Decision Boundary of the Final Ensemble and Training Set Predictions (Image by author)

Step 5: Performance Evaluation

We can use accuracy_score to evaluate the performance of our final model as follows:

# Calculate the accuracy score
y_pred = clf.predict(X_train)
accuracy = accuracy_score(y_train, y_pred)
print(f"Accuracy score on the training set: {accuracy}")

y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy score on the test set: {accuracy}")

Cell Output:

The accuracy_score is a common measure to evaluate classification models that computes the ratio of correct predictions to the total number of predictions. In our case, the accuracy score for the test set is 0.83 which means 83% of the total test samples were correctly classified by our model.

The predictions for the test set are shown below (not bad 😸 ):

Test Set Predictions (Image by author)
Test Set Predictions (Image by author)

Link to Code

The full notebook can be found here.

Why choose AdaBoost?

AdaBoost has many nice properties making it a good candidate for both classification and regression (yes there are versions for regression too that we did not cover here, but totally possible!).

Apart from increasing the accuracy of weak learners, some other benefits of AdaBoost are:

  • There is less need for hyperparameter tuning (there aren’t many to worry much about).
  • AdaBoost is also capable of identifying outliers. The samples which are either mislabeled in the training data, or are inherently ambiguous will get assigned higher weights, and such samples can be flagged as potential outliers.

Disadvantages of AdaBoost

  • One major drawback of AdaBoost is its sensitivity to noise and outliers. As discussed earlier, the outlier samples will get assigned higher weights, which means the algorithm will try very hard to get their predictions correct. This in turn will lead to overfitting that is not desirable.
  • AdaBoost can be computationally intensive because of the sequential nature of boosting. Each predictor can only be trained after the previous one has already been trained and evaluated, which limits the scope for parallelization and scalability.

Conclusion

In this article, we covered the basics of ensemble learning and boosting, with a focus on the AdaBoost algorithm for classification. We covered the algorithm in theory to get an understanding of the underlying logic and then proceeded with the implementation in Python using Scikit-Learn. We also looked at some interesting visualizations to take a peek at the intermediate weak learners and their predictions.

Following are some concluding remarks:

  • Boosting is a type of ensemble learning method in which the weak learners are trained sequentially each trying to correct the mistakes of its predecessor.
  • AdaBoost is one of the earliest and most popular boosting algorithms. It stands for Adaptive Boosting.
  • The core principle of AdaBoost is to fit a sequence of weak learners (e.g. Decision Trees) on repeatedly re-sampled versions of the data. Each sample carries a weight that is adjusted after each training step, such that misclassified samples will be assigned higher weights. In this way, the subsequent learners will pay more attention to hard cases.
  • By default, AdaBoost uses Decision Tree with max_depth = 1 which is also known as stump as the individual weak learner. A stump is a decision tree with a root node and two leaf nodes. We are free to choose weak learners other than stumps.
  • The original AdaBoost algorithm works only for binary classification, however, there have been several enhancements to date. Scikit-Learn uses a multi-class implementation of AdaBoost known as SAMME (Stagewise Additive Modeling using a MultiClass Exponential loss function).
  • In addition to classification, AdaBoost can also be used for regression. Scikit-Learn has inbuilt AdaBoostRegressor as well as AdaBoostClassifier for regression and classification, respectively.
  • AdaBoost has several advantages such as increased accuracy for weak learners, less need for hyperparameter tuning, etc.
  • AdaBoost also has some drawbacks such as sensitivity to outliers and the potential risk of overfitting when the weak learners are too complex.

Thank you for reading, I hope it was helpful 🙂

Open to any feedback or suggestions.


References

[1] https://scikit-learn.org/stable/auto_examples/ensemble/plot_adaboost_multiclass.html#sphx-glr-auto-examples-ensemble-plot-adaboost-multiclass-py

[2] Freund, Y. and R.E. Schapire, 1999. A Short Introduction to Boosting. Journal of Japanese Society for Artificial Intelligence.

[3] Aurélien, G., 2019. Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly


Related Articles