All You Need to Know about Gradient Boosting Algorithm − Part 2. Classification

Algorithm explained with an example, math, and code

Tomonori Masui
Towards Data Science

--

Image by author

In the Part 1 article, we learned the gradient boosting regression algorithm in its detail. As we reviewed in that post, the algorithm is flexible enough to deal with any loss functions as long as it is differentiable. That means if we just replace the loss function used for regression, specifically mean squared loss, with a loss function that deals with classification problems, we can perform classification without changing the algorithm itself. Even though the base algorithm is the same, there are still some differences that we want to know. In this post, we will dive into all the details of the classification algorithm.

Algorithm with an Example

Gradient boosting is one of the variants of ensemble methods where you create multiple weak models (they are often decision trees) and combine them to get better performance as a whole. In this section, we are building a gradient boosting classification model using very simple example data to intuitively understand how it works.

The picture below shows the sample data. It has the binary class y (0 and 1) and two features x₁ and x₂.

Sample for the classification problem (Image by author)

Our goal is to build a gradient boosting model that classifies those two classes. The first step is making a uniform prediction on a probability of class 1 (we will call it p) for all the data points. The most reasonable value for the uniform prediction might be the proportion of class 1 which is just a mean of y.

Here is a 3D representation of the data and the initial prediction. At this moment, the prediction is just a plane that has the uniform value p = mean(y) on the y axis all the time.

Prediction plane (Image by author)

In our data, the mean of y is 0.56. As it is bigger than 0.5, everything is classified into class 1 with this initial prediction. Some of you might feel that this uniform value prediction does not make sense, but don’t worry. We will improve our prediction as we add more weak models to it.

To improve our prediction quality, we might want to focus on the residuals (i.e. prediction error) from our initial prediction as that is what we want to minimize. The residuals are defined as rᵢ = yᵢ − p (i represents the index of each data point). In the figure below, the residuals are shown as the brown lines that are the perpendicular lines from each data point to the prediction plane.

Residuals (Image by author)

To minimize these residuals, we are building a regression tree model with both x₁ and x₂ as its features and the residuals r as its target. If we can build a tree that finds some patterns between x and r, we can reduce the residuals from the initial prediction p by utilizing those found patterns.

To simplify the demonstration, we are building very simple trees each of that only has one split and two terminal nodes which is called “stump”. Please note that gradient boosting trees usually have a little deeper trees such as ones with 8 to 32 terminal nodes.

Here we are creating the first tree predicting the residuals with two different values r = {0.1, -0.6}.

Created tree (Image by author)

You might now think we want to add these predicted values to our initial prediction p to reduce its residuals if you already read the post talking about the regression algorithm, but things are slightly different with classification. The values (we call it γ gamma) that we are adding to our initial prediction is computed in the following formula:

Σxᵢ∈Rⱼmeans we are aggregating the values in the sigma Σon all the sample xᵢs that belong to the terminal node Rⱼ. j represents the index of each terminal node. You might notice that the numerator of the fraction is the sum of the residuals in the terminal node j. We will go through all the calculations that give us this formula in the next section, but let’s just use it to calculate γ for now. Below is the computed values of γ₁ and γ₂.

This γ is not simply added to our initial prediction p. Instead, we are converting p into log-odds (we will call this log-odds converted value F(x)), then adding γ to it. For those who are not familiar with log-odds, it is defined below. You might have seen it used in logistic regression.

log-odds

One more tweak on the prediction update is that γ is scaled down by learning rate ν, which ranges between 0 and 1, before it is added to the log-odds-converted prediction F(x). This helps the model not to overfit the training data.

In this example, we use a relatively big learning rate ν = 0.9 to make the optimization process easier to understand, but it is usually supposed to be much smaller values such as 0.1.

By substituting actual values for the variables in the right side of the above equation, we get our updated prediction F₁(x).

If we convert log-odds F(x) back into the predicted probability p(x) (we will cover how we can convert it in the next section), it looks like a stair-like object below.

Updated prediction plane (Image by author)

The purple-colored plane is the initial prediction p₀ and it is updated to the red and yellow plane p₁.

Now, the updated residuals r looks like this:

Updated residuals (Image by author)

In the next step, we are creating a regression tree again using the same x₁ and x₂ as the features and the updated residuals r as its target. Here is the created tree:

Created tree (Image by author)

We apply the same formula to compute γ. The calculated γ along with the updated prediction F₂(x) are as follows.

Again, if we convert log-odds F₂(x) back into the predicted probability p₂(x), it looks like something below.

Updated prediction plane (Image by author)

We iterate these steps until the model prediction stops improving. The figures below show the optimization process from 0 to 4 iterations.

Prediction update through iterations (Image by author)

You can see the combined prediction p(x) (red and yellow plane) is getting closer to our target y as we add more trees into the combined model. This is how gradient boosting works to predict complex targets by combining multiple weak models.

The image below summarizes the whole process of the algorithm.

Process of the algorithm (Image by Author)

Math

In this section, we are learning a more generalized algorithm by looking at its math details. Here is the whole algorithm in math formulas.

Source: adapted from Wikipedia and Friedman’s paper

Let’s take a close look at it line by line.

Step 1

The first step is creating an initial constant prediction value F₀. L is the loss function and we are using log loss (or more generally called cross-entropy loss) for it.

Log Loss

yᵢ is our classification target and it is either 0 or 1. p is the predicted probability of class 1. You might see L taking different values depending on the target class yᵢ.

As −log(x) is the decreasing function of x, the better the prediction (i.e. increasing p for yᵢ=1), the smaller loss we will have.

argmin means we are searching for the value γ (gamma) that minimizes ΣL(y,γ). While it is more straightforward to assume γ is the predicted probability p, we assume γ is log-odds as it makes all the following computations easier. For those who forgot the log-odds definition which we review in the previous section, it is defined as log(odds) = log(p/(1-p)).

To be able to solve the argmin problem in terms of log-odds, we are transforming the loss function into the function of log-odds.

Now, we might want to replace p in the above equation with something that is expressed in terms of log-odds. By transforming the log-odds expression shown earlier, p can be represented by log-odds:

Then, we are substituting this value for p in the previous L equation and simplying it.

Now we are finding γ (please remember we are assuming it is log-odds) that minimizes ΣL. We are taking a derivative of ΣL with respect to log-odds.

In the equations above, we replaced the fraction containing log-odds with p to simplify the equation. Next, we are setting ∂ΣL/∂log(odds) equal to 0 and solving it for p.

In this binary classification problem, y is either 0 or 1. So, the mean of y is actually the proportion of class 1. You might now see why we used p = mean(y) for our initial prediction.

As γ is log-odds instead of probability p, we are converting it into log-odds.

Step2

The whole step2 processes from 2–1 to 2–4 are iterated M times. M denotes the number of trees we are creating and the small m represents the index of each tree.

Step 2–1

We are calculating residuals rᵢ𝑚 by taking a derivative of the loss function with respect to the previous prediction F𝑚-₁ and multiplying it by −1. As you can see in the subscript index, rᵢ𝑚 is computed for each single sample i. Some of you might be wondering why we are calling this rᵢ𝑚 residuals. This value is actually negative gradient that gives us the directions (+/−) and the magnitude in which the loss function can be minimized. You will see why we are calling it residuals shortly. By the way, this technique where you use a gradient to minimize the loss on your model is very similar to gradient descent technique which is typically used to optimize neural networks. (In fact, they are slightly different from each other. If you are interested, please look at this article detailing that topic.)

Let’s compute the residuals here. F𝑚-₁ in the equation means the prediction from the previous step. In this first iteration, it is F₀. As in the previous step, we are taking a derivative of L with respect to log-odds instead of p since our prediction F𝑚 is log-odds. Below we are using L expressed by log-odds which we got in the previous step.

In the previous step, we also got this equation:

So, we can replace the second term in rᵢ𝑚 equation with p.

You might now see why we call r residuals. This also gives us interesting insight that the negative gradient that provides us the direction and the magnitude to which the loss is minimized is actually just residuals.

Step 2–2

j represents a terminal node (i.e. leave) in the tree, m denotes the tree index, and capital J means the total number of leaves.

Step 2–3

We are searching for γⱼ𝑚 that minimizes the loss function on each terminal node j. Σxᵢ∈Rⱼ𝑚 L means we are aggregating the loss on all the xᵢs that belong to the terminal node Rⱼ𝑚. Let’s plugin the loss function into the equation.

Solving this equation for γⱼ𝑚 is going to be very hard. To make it more easily solvable, we are making the approximation of L by using the second-order Taylor polynomial. A Taylor polynomial is a way to approximate any function as a polynomial with an infinite/finite number of terms. While we are not looking into its details here, you can look at this tutorial which explains the idea nicely if you are interested.

Here is the approximation of L using the second-order Taylor polynomial:

We are substituting this approximation for L in the equation of γⱼ𝑚, then finding the value of γⱼ𝑚 that makes the derivative of Σ(*) equal zero.

As we already computed ∂L/∂F in the previous step as below:

We are substituting this for ∂L/∂F in the γ equation.

Finally, we got this simplified equation for the value of γⱼ𝑚 which we used in the previous section.

Step 2–4

In the final step, we are updating the prediction of the combined model F𝑚. γⱼ𝑚1(x ∈ Rⱼ𝑚) means that we pick the value γⱼm if a given x falls in a terminal node Rⱼ𝑚. As all the terminal nodes are exclusive, any given single x falls into only a single terminal node and corresponding γⱼ𝑚 is added to the previous prediction F𝑚-₁ and then it makes the updated prediction F𝑚.

As mentioned in the previous section, ν is learning rate ranging between 0 and 1 which controls the degree of contribution of the additional tree prediction γ to the combined prediction F𝑚. A smaller learning rate reduces the effect of the additional tree prediction, but it basically also reduces the chance of the model overfitting to the training data.

Now we have gone through the whole steps. To get the best model performance, we want to iterate step 2 M times, which means adding M trees to the combined model. In reality, you might often want to add more than 100 trees to get the best model performance.

If you read my article for a regression algorithm, you might feel the math computation of the classification algorithm is much more complicated than the regression. While argmin and the derivative computation of the log-loss function is complicated, the underlining math algorithm which is shown at the beginning of this section is exactly the same. That is actually the elegance of the gradient boosting algorithm as exactly the same algorithm works for any loss function as long as it is differentiable. In fact, popular gradient boosting implementations such as XGBoost or LightGBM have a wide variety of loss functions, so you can choose whatever loss functions that suit your problem (see the various loss functions available in XGBoost or LightGBM).

Code

In this section, we are translating the maths we just reviewed into a viable python code to help us understand the algorithm further. We are using DecisionTreeRegressor from scikit-learn to build trees which helps us just focus on the gradient boosting algorithm itself instead of the tree algorithm. We are imitating scikit-learn style implementation where you train the model with fit method and make predictions with predict method.

in

Please note that all the trained trees are stored in self.trees list object and it is retrieved when we make predictions with predict_proba method.

Next, we are checking if our CustomGradientBoostingClassifier performs as the same as GradientBoostingClassifier from scikit-learn by looking at their log-loss on our data.

As you can see in the output above, both models have exactly the same log-loss.

Recommended Resources

In this blog post, we have reviewed all the details of the gradient boosting classification algorithm. If you are also interested in the regression algorithm, please look at the Part 1 article.

There are also some other great resources if you want further details of the algorithm:

  • StatQuest, Gradient Boost Part3 and Part 4
    These are the YouTube videos explaining the gradient boosting classification algorithm with great visuals in a beginner-friendly way.
  • Terence Parr and Jeremy Howard, How to explain gradient boosting
    While this article focuses on gradient boosting regression instead of classification, it nicely explains every detail of the algorithm.
  • Jerome Friedman, Greedy Function Approximation: A Gradient Boosting Machine
    This is the original paper from Friedman. While it is a little hard to understand, it surely shows the flexibility of the algorithm where he shows a generalized algorithm that can deal with any type of problem having a differentiable loss function.

You can also look at the full Python code in the Google Colab link or the Github link below.

References

--

--