Adaboost for Dummies: Breaking Down the Math (and its Equations) into Simple Terms

Haebichan Jung
Towards Data Science
7 min readApr 9, 2018

--

Learning machines is a long walk up the stairs. But we can shorten the path. (Photo by Gabriel Izgi on Unsplash)

Adaboost, shortened for Adaptive Boosting, is an machine learning approach that is conceptually easy to understand, but less easy to grasp mathematically. Part of the reason owes to equations and formulas not being broken down into simple terms with basic math as demonstration of the equations. This essay intends to do just that with Adaboost, with newcomers into data science as the primary target audience.

There are many wonderful lectures, videos, and papers that effectively and neatly explain the concept behind Adaboost. Simply put, the idea is to set weights to both classifiers and data points (samples) in a way that forces classifiers to concentrate on observations that are difficult to correctly classify. This process is done sequentially in that the two weights are adjusted at each step as iterations of the algorithm proceed. This is why Adaboost is referred to as a sequential ensemble method — ensemble referring to a type of learning that combines several models to improve the final predictive performance.

The concept is easy to follow, but once we try to dig a little deeper to understand the math that supports the concept, we become faced with many articles and lectures with this similar-looking sight:

For those without academic background in math and/or the non-math types now studying data science following the recent machine learning boom, the notations alone can seem quite daunting. This article intends to clear away the confusion and the intimidation by breaking down the notations and the formula, and using simple math to explain how Adaboost works.

Basic Terminologies

First, let’s go over some basic terminologies.

Boosting: combining many weak (simple) learners to create a highly accurate prediction.

Weak learners: classifiers that produce prediction that is slightly better than random guessing. Random guessing is equivalent to 50%, like flipping a coin. This will be familiar to those who are conversant with information theory, particularly the idea of Shannon’s entropy.

Hypothesis: our classifier, aka the function that our machine learning algorithm makes to approximate the unknown function, the target (true) function, which models the relationship between the input values x and output values y.

Adaboost: The first practical boosting algorithm invented by Freund and Schapire (1995). It is based on Vapnik and Chervonekis’ idea that for a trained classifier to be effective and accurate in its predictions, it should meet these three conditions:

1) classifier should be trained on “enough” training examples

2) it should provide a good fit to these examples by producing low training error

3) it should be simple (in that simpler models are better than overly complex one)

Explaining Adaboost, One Step at a Time

Let us go over the iterative formula, breaking down every notation at each step on a granular level for easy digestion.

1) Given (x_1,y_1),…..,(x_m,y_m) where x_i ∈ X, y_i ∈ {-1, +1}

Useful Notations

∈: “element of”

{}: set

ex: if A = {1,2,3,7}, 2 ∈ A

(x_1, y_1): first training sample, (x_m,y_m) = m-th training sample

Now that we have all the notations down, we can read the the first part of the formula as:

“Given the training set containing m samples where all x inputs are an element of the total set X and where y outputs are an element of a set comprising of only two values, -1 (negative class) and 1 (positive class)…”

2) Initialize: D1(i) = 1/m for i = 1, …,m.

Here, D = weights of samples and i = the i-th training sample. In other papers, the D will be written as W. Thus the next statement reads:

“…initialize all weights of your samples to 1 divided by number of training sample…”

3) For t=1, …, T:

* train weak learner using distribution Dt.

* Get weak hypothesis h_t: X -> {-1, +1}

* Aim: select h_t with low weighted error:

ε = Pr_i~Dt [h_t(xi) not equal to y_i]

* Choose α_t = 1/2 * ln(1-ε / ε)

* Update, for i = 1,…,m:

Dt+1(i) = Dt(i)exp(-αt * y_i * h_t(x_i) / Zt

Useful Notations

Pr = probability

h_t = hypothesis/classifier

ε = minimum misclassification error for the model

α = weight for the classifier

exp = euler’s e: 2.71828

Zt = normalization factor, used to ensure that weights represent a true distribution

With these notations at hand, we can read the next portion as:

“For t=1 to T classifiers, fit it to the training data (where each prediction is either -1 or 1) and select the classifier with the lowest weighted classification error.”

The formula to formally compute ε is described as follows:

Let’s break down this particular model.

Useful Notations

Σ = sum

y_i not equal to h_j = 1 if misclassified and 0 if correctly classified

w_i = weight

Thus, the formula reads: “Error equals the sum of the misclassification rate, where weight for training sample i and y_i not being equal to our prediction h_j (which equals 1 if misclassified and 0 if correctly classified).”

Let us apply simple math to make sense of the formula. Consider having 4 different samples with weights 0.5, 0.2, 0.1 and 0.04. Imagine our classifier h predicted values 1, 1, -1 ,and -1, but the actual output value y was -1, 1, -1, 1.

predicted: 1 1 -1 -1

actual: -1 1 -1 1

weights: 0.5 0.2 0.1 0.04

1 or 0: 1 0 0 1

This leads to the following calculation for the misclassification rate:

misclassification rate / error = (0.5*1 + 0.2*0 + 0.1*0 + 0.04*1) / (0.5 + 0.2 + 0.1 + 0.04)

error = 0.64285714285

Next, choose our weight for the classifier, α, by the formula that reads 1/2 * ln(1- error / error).

Simple math might explain better than words could here. Assume for instance, that we have errors 0.30, 0.70, 0.5.

Our classifier weights would be calculated as follows:

ε = 0.3

α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365

ε = 0.7

α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365

ε = 0.5

α = 1/2 * ln(1- 0.5 / 0.5) = 0

Notice three interesting observations: 1) classifier with accuracy higher than 50% results in a positive weight for the classifier (in other words, α > 0 if ε <= 0.5), 2) classifier with exact 50% accuracy is 0, and thus, does not contribute to the final prediction, and 3) errors 0.3 and 0.7 lead to classifier weights with inverse signs.

Now comes the very important part of the equation: updating the weights for each sample. I have mentioned above that the point of the update is to force classifiers to concentrate on observations that are difficult to correctly classify. This is done by making misclassified cases to be updated with increased weights after an iteration. Increased weights would make our learning algorithm pay higher attention to these observations in the next iteration. Conversely, correctly classified cases would receive a decreased weight, and reduced attention by our classifiers in the next iteration.

Again, with simple numbers as demonstration, information uptake is painless. Let us use the error rate of 0.3 above to plug into the formula. Remember that we are looking for the low weighted error. In other words, we should not use error rate from and above 0.5. With the low error rate, let us check what happens when cases are misclassified and when cases are correctly classified.

misclassified
correctly classified

And there you have it! In the case of incorrect classification, the exp term became larger than 1, and in the case of correct classification, the exp term became below 1. Therefore, incorrect classification would receive higher weights, prompting our classifiers to pay more attention to them in the next iteration, while the opposite case of correct classification would result in the converse.

We continue this iteration until a) low training error is achieved, or b) preset number of weak learners have been added (this is a parameter that is under our control). We then take the final prediction by adding up the weighted prediction of every classifier.

Summary

We have seen how Adaboost works in a granular level, by breaking down every notation in the formula. We have then applied simple math to understand how each component of the formula works. This practice of approaching formulas through its decomposed parts can be a useful practice for understanding machine learning algorithms.

--

--

Silicon Valley Data Scientist | Former Project Lead @TowardsDataScience (Medium)