The world’s leading publication for data science, AI, and ML professionals.

Adaboost: Intuition and Explanation

Adaboost is one of the most popular boosting algorithms. Look under the hood and see what’s going on.

Photo by Ashkan Forouzani on Unsplash
Photo by Ashkan Forouzani on Unsplash

Boosting is an important tool to have in your Machine Learning toolkit. It is an ensemble method – a machine learning technique that combines multiple models to create a better model. Boosting is widely used in the machine learning world, and AdaBoost is one of the most popular boosting algorithms. Let’s understand how it works.

An Intuitive Story

Let’s say you’re a medical student, and you’re looking at x-rays to determine whether or not someone has a rare disease – a binary classification problem. Because this is a rare and complicated disease, there is only one expert in the entire country. This expert can’t articulate exactly how to classify an x-ray because it’s too complicated to write down. Instead, he can help you by giving rules of thumb. For example, he might say "if the x-ray shows high contrast, the patient is more likely to have the disease" or "if the x-ray shows more blood vessels, the patient is less likely to have the disease". Over time, by combining these rules of thumb, you begin to classify patients better and better.

Before you’ve learned any rules of thumb, you know nothing. Thus, you have an equal chance of classifying each x-ray. Once you’ve learned some rules, however, you may notice that some x-rays are usually classified correctly with the rules you’ve learned, and others are usually classified incorrectly. It seems sensible to put more importance on the x-rays that you keep misclassifying. You ask the doctor to start providing rules of thumb that are geared towards these difficult x-rays, as opposed to general rules of thumb.

The story we just described is a metaphor for how AdaBoost works. We learn multiple weak classifiers, combining them to get a stronger classifier. At the beginning of learning, we don’t know anything, so we have no preference on what rules we learn. Later on, we focus on rules that help classify the data points we are having trouble with.

The Mathematical Formulation

Here is the AdaBoost algorithm in formal notation:

Source: Freund and Schapire, A Short Introduction to Boosting, 1999
Source: Freund and Schapire, A Short Introduction to Boosting, 1999

As in our story, we assume a binary classification problem. The training set is denoted by (x_i, y_i), drawn from some parent distribution. There are T total timesteps in the algorithm. For each timestep t, we define a weight D_t(i) for each training example i. This weight determines the "hardness" of that particular training example, and AdaBoost will give a higher weight to harder examples. At the beginning (t = 1) we don’t know which examples are hard, so we set all the weights to be equal.

Each timestep of the training goes like this. First, we train a "weak classifier" based on the distribution defined on the training set by the hardness weights. Intuitively, we are making sure our weak classifier puts more importance on the hard data, since we currently have trouble with that data. Each weak classifier corresponds to a rule of thumb given by the doctor in our story. Roughly speaking, a "weak classifier" is a classifier that is better than chance. The mathematically precise definition of "weak" requires some more theory. For the purposes of understanding AdaBoost, our rough definition is good enough.

Once we have a weak classifier, we calculate its error on the training set, denoted as ϵ_t (second bullet point). Notice that this error is weighted with the hardness weights. Then we calculate α_t (third bullet point). Notice that the smaller the training error, the larger α_t is. Also notice that by the definition of "weak classifier" ϵ_t is at most 0.5, so α_t is always positive. Next we re-weight our training data based on whether they were correctly or incorrectly classified by a factor of e^(-α_t) or e^(α_t), respectively. Note that since α_t is always positive, e^(-α_t) <1 and e^(α_t) >1 always. Thus, if the data is classified correctly, its weight will decrease, and if the data is classified incorrectly, its weight will increase. This is where the name "AdaBoost" comes from – AdaBoost stands for "adaptive boosting", and "adaptive" in this context means changing the weights as we just described.

Over time, data that is consistently misclassified will receive a higher and higher weight. The algorithm is then incentivized to learn a weak classifier that is able to classify the difficult data. In this way, Adaboost makes sure that all data is covered.

After T timesteps, we take the weighted sum of all the weak classifiers as our final answer. The weights are the α_t. This is because lower ϵ_t corresponds to higher α_t – we trust a classifier with lower error more than one with higher error. How good is our final classifier? It turns out that this classifier is a "strong" classifier. Again, like with weak classifiers, a proper definition of "strong" requires some more theory. But roughly speaking, a "strong classifier" is a classifier that with enough time and data can get arbitrarily small error.

There is one important clarification we have to make. We said the final classifier was strong. Recall that our training set that we ran T iterations of AdaBoost on was drawn from some parent distribution. So is the final classifier strong on just the training set, or on the entire parent distribution? It turns out that the answer is both. The theoretical results are as follows. A proof of these results, as well as more rigorous definitions of "strong" and "weak", can be found here.

  • The training set error decreases exponentially with the number of Adaboost iterations T. Therefore, we can make the training set error arbitrarily small by running more iterations.
  • The generalization error (error on the parent distribution) decreases with the square root of the size of the training set. Therefore, we can make the generalization error arbitrarily small by increasing the size of the training set.

Recap

We’ve gone through the intuition behind Adaboost and told a story of how it might be used. Then we looked at the formal algorithm and discussed the most important parts. Finally, we stated the main theoretical result: Adaboost converts weak classifiers into strong classifiers. I hope you leave this article with a better understanding of how Adaboost works. Any feedback is welcome, thanks for reading!


Related Articles