
Gradient Boosting is a widely used machine learning technique that is based on a combination of boosting and gradient descent.
Boosting is an ensemble method that combines multiple weak learners (or base learners) to create a strong predictive model. The base models are trained sequentially, where each model focuses on correcting the errors made by the previous models.
In gradient boosting, each base model is trained to predict the negative gradients of the loss function with respect to the predictions of the previous models. As a result, adding the newly trained base learner to the ensemble makes a step in the steepest descent direction towards the minimum of the loss. This process is similar to gradient descent, but it operates in the function space rather than the parameter space. Therefore, it is known as functional gradient descent.
When the weak learners are decision trees, the resulting method is known as gradient-boosted decision trees (GBDT) or gradient boosting machine (GBM).
Gradient boosting is one of the best algorithms that exist today for dealing with structured (tabular) data, and provides state-of-the-art results on many standard classification benchmarks. Alongside deep learning, it is one of the most commonly used algorithms in Kaggle competitions.
The gradient boosting algorithm was originally developed by Jerome Freidman in 2001 [1]. Since then, it has evolved into a family of algorithms that includes XGBoost, CatBoost and LightGBM. These variants of the algorithm incorporate various enhancements that further improve the performance and scalability of gradient boosting.
This article covers in depth the theory and implementation of gradient boosting. In the first part of the article we will focus on the theoretical concepts of gradient boosting, present the algorithm in pseudocode, and demonstrate its usage on a small numerical example. In the second part, we will explore the classes in Scikit-Learn that implement gradient boosting, and use them to solve different regression and classification tasks.
Intuitive Introduction
As a reminder, in supervised machine learning problems, we are given a training set of n labeled samples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where xᵢ is a m-dimensional vector that contains the features of sample i, and yᵢ represents the label of that sample. Our goal is to build a model whose predictions are as close as possible to the true labels.
For our initial discussion, we will assume that the learning problem is a regression problem, i.e., the targets yᵢ are continuous values.
The basic idea of gradient boosting is to build an ensemble of weak models, where each model is trained to predict the residuals of the previous model. This process can be described as follows:
- Fit a base model _h_₁(x) to the given labels y.
- Set the initial ensemble to _F_₁(x) = _h_₁(x).
- Fit a base model _h_₂(x) to the residuals y − _F_₁(x).
- Combine the two models into a new ensemble: _F_₂(x) = _h_₁(x) + _h_₂(x). The predictions of _F_₂(x) should be closer to the targets than _F_₁(x).
- Fit a base model _h_₃(x) to the residuals y − _F_₂(x).
- Combine the three models into a new ensemble: _F_₃(x) = _h_₁(x) + _h_₂(x) + _h_₃(x). The predictions of _F_₃(x) should be closer to the targets than _F_₂(x).
- Continue this process for M steps.
- Return _F_ₘ(x) as the final hypothesis.
We can demonstrate this process in Python by manually building a sequence of regression trees, where each tree is trained to predict the residuals of the previous trees.
Let’s first generate some noisy quadratic training set:
n_samples = 100
X = np.random.rand(n_samples, 1) - 0.5
y = 5 * X[:, 0] ** 2 + 0.1 * np.random.randn(n_samples)
plt.scatter(X, y, s=20)

Our base learners will be decision trees with maximum depth of 2. The first decision tree _h_₁ is fit to the given data set:
h1 = DecisionTreeRegressor(max_depth=2)
h1.fit(X, y)
The first ensemble _F_₁ consists of this single tree, and its _R_² score on the training set is:
F1 = [h1] # ensemble of one tree
F1_pred = h1.predict(X)
print(f'R2 score of F1: {r2_score(y, F1_pred):.4f}')
R2 score of F1: 0.7819
The second tree _h_₂ is fitted to the residual errors made by the first tree:
h2 = DecisionTreeRegressor(max_depth=2)
y2 = y - F1_pred
h2.fit(X, y2)
_h_₂ is added to the ensemble to create the second ensemble _F_₂. We can use _F_₂ to make predictions by simply adding up the predictions of both trees:
F2 = [h1, h2] # ensemble of two trees
F2_pred = sum(h.predict(X) for h in F2)
print(f'R2 score of F2: {r2_score(y, F2_pred):.4f}')
R2 score of F2: 0.8802
Lastly, a third tree _h_₃ is fitted to the residuals of the second ensemble _F_₂, and then added to the ensemble:
h3 = DecisionTreeRegressor(max_depth=2)
y3 = y - F2_pred
h3.fit(X, y3)
F3 = [h1, h2, h3] # ensemble of three trees
F3_pred = sum(h.predict(X) for h in F3)
print(f'R2 score of F3: {r2_score(y, F3_pred):.4f}')
R2 score of F3: 0.9124
Notice how the _R_² score of the model gradually increases as we add more trees to the ensemble.
Let’s plot the residuals and the ensemble’s predictions after every iteration:
fig, axes = plt.subplots(2, 3, sharex=True, sharey=True, figsize=(12, 7))
X_test = np.linspace(-0.5, 0.5, 500).reshape(-1, 1)
for i, h, residuals in zip([0, 1, 2], [h1, h2, h3], [y, y2, y3]):
ax = axes[0, i]
y_test_pred = h.predict(X_test)
ax.scatter(X, residuals, c='k', s=20, marker='x', label='Residuals')
ax.plot(X_test, y_test_pred, 'r', linewidth=2)
ax.set_title(f'$h_{i + 1}(x)$')
ax.legend(loc='upper center')
for i, ensemble in enumerate([F1, F2, F3]):
ax = axes[1, i]
y_test_pred = sum(h.predict(X_test) for h in ensemble)
ax.scatter(X, y, s=20, label='Training set')
ax.plot(X_test, y_test_pred, 'm', linewidth=2)
ax.set_title(f'$F_{i + 1}(x)$')
ax.legend(loc='upper center')

We can see that the ensemble’s predictions gradually improve as more trees are added to the ensemble.
The generalization of gradient boosting to other types of problems (e.g., classification problems) and other loss functions follows from the observation that the residuals hₘ(xᵢ) are proportional to the negative gradients of the squared loss function with respect to Fₘ₋₁(xᵢ):

Therefore, we can generalize this technique to any differentiable loss function by using the negative gradients of the loss function instead of the residuals.
The Gradient Boosting Algorithm
We will now derive the general gradient boosting algorithm for any differentiable loss function.
Boosting approximates the true mapping from the features to the labels y = f(x) using an additive expansion (ensemble) of the form:

where hₘ(x) are base learners from some class H (usually decision trees of a fixed size), and M is the number of learners.
Given a loss function L(y, F(x)), our goal is to find an approximation F(x) that minimizes the total loss on the training set:

The objective function J includes functions as parameters (the functions hₘ), thus it cannot be optimized using traditional optimization methods such as gradient descent. Instead, the model is trained in an additive manner, by adding one base learner at a time.
We start from a model _F_₀ of a constant function that minimizes the objective:

For example, if the loss function is squared loss (used in regression problems), _F_₀(x) would simply be the mean of the target values.
Then, we incrementally expand the model in a greedy fashion:

where the newly added base learner hₘ is fitted to minimize the expected loss of the ensemble Fₘ:

Finding the best function hₘ for an arbitrary loss function L is computationally infeasible, since it would require us to enumerate all the possible functions in H and pick the best one. Instead, we use an iterative approach: in every iteration we choose a base learner hₘ that points in the negative gradient direction of the loss function. As a result, adding hₘ to the ensemble will get us one step closer to the minimum loss.
This process is similar to gradient descent, but it operates in the function space rather than the parameter space, since in every iteration we move to a different function in the hypothesis space H, rather than making a step in the parameter space of a specific function h. This allows h to be a non-parametric Machine Learning model, such as a decision tree. This process is known as functional gradient descent.
In functional gradient descent, our parameters are the values of F(x) at the points x₁, …, xₙ, and we seek to minimize L(yᵢ, F(xᵢ)) at each individual xᵢ. The best steepest-descent direction of the loss function at every point xᵢ is its negative gradient:

gₘ(xᵢ) is the derivative of the loss with respect to its second parameter, evaluated at Fₘ₋₁(xᵢ).
Therefore, the vector

gives the best steepest-descent direction in the n-dimensional data space at Fₘ₋₁(xᵢ). However, this gradient is defined only at the data points x₁, …, xₙ, and cannot be generalized to other x-values.
In the continuous case, where H is the set of arbitrary differentiable functions on R, we could have simply chosen a function hₘ ∈ H where hₘ(xᵢ) = –gₘ(xᵢ).
In the discrete case (i.e., when the set H is finite), we choose hₘ as a function in H that is closest to gₘ(xᵢ) at the data points xᵢ, i.e., hₘ that is most parallel to the vector –gₘ in Rⁿ. This function can be obtained by fitting a base learner hₘ to a training set {(xᵢ, ỹᵢₘ)}, with the labels

These labels are called pseudo-residuals. In other words, in every boosting iteration, we are fitting a base learner to predict the negative gradients of the loss function with respect to the ensemble’s predictions from the previous iteration. Note that this approach is heuristic, and does not necessarily yield an exact solution to the optimization problem.
The complete pseudocode of the algorithm is shown below:

Gradient Tree Boosting
Gradient tree boosting is a specialization of the gradient boosting algorithm to the case where the base learners h(x) are regression trees (see this article for more details on regression trees).
At each iteration m, a regression tree hₘ(x) is fit to the pseudo-residuals, i.e., we build a tree to predict the pseudo-residuals given our data points. The tree is built in a top-down greedy fashion using mean squared error as the splitting criterion.
The question is which label the tree should assign to a given leaf node? Using the mean value of the samples in that leaf works well for regression problems, but not for other types of problems. Thus, we need to find a general way to assign an output value for each leaf node, which will minimize the expected loss of the model for any differentiable loss function.
Let Jₘ be the number of the leaves in the tree. The tree partitions the input space into Jₘ disjoint regions: _R_₁ₘ, …, Rⱼₘ, and predicts a constant value γⱼₘ in each one:

where 1(⋅) is the indicator function, which has the value 1 if its argument is true, and 0 otherwise.
Our goal is to find the coefficient γⱼₘ in each region __ that will minimize the total loss induced by the points that belong to that region:

i.e., we are trying to find the optimal constant update in each leaf node’s region to be added to the predictions of the previous ensemble Fₘ₋₁(x).
For example, let’s find the optimal γ for the case of least-squares regression. In this case, our objective is to minimize the following sum of squared losses:

We now take the derivative of E with respect to γ and set it to 0:

That is, the optimal γ is simply the mean of the residuals in the _j_th leaf node at the _m_th iteration:

The case of classification will be discussed later.
Once we find the optimal output values for the leaf nodes, the current approximation Fₘ₋₁(x) is separately updated in each corresponding region:

Regularization
Compared to standard decision trees, gradient-boosted trees are fairly robust to overfitting. Nevertheless, there are several commonly used regularization techniques that help to control the complexity of gradient-boosted trees.
First, we can use the same regularization techniques that we have in standard decision trees, such as limiting the depth of the tree, the number of leaves, or the minimum number of samples required to split a node. We can also use post-pruning techniques to remove branches from the tree that fail to reduce the loss by a predefined threshold.
Second, we can control the number of boosting iterations (i.e., the number of trees in the ensemble). Increasing the number of trees reduces the ensemble’s error on the training set, but may also lead to overfitting. The optimal number of trees is typically found by early stopping, i.e., the algorithm is terminated once the score on the validation set does not improve for a specified number of iterations.
Lastly, Friedman [1, 2] has suggested the following regularization techniques, which are more specific to gradient-boosted trees:
Shrinkage
Shrinkage [1] scales the contribution of each base learner by a constant factor ν:

The parameter ν (0 < ν ≤ 1) is called the learning rate, as it controls the step size of the gradient descent procedure. Similar to a learning rate in stochastic optimization, shrinkage reduces the impact of each individual learner and leaves room for future learners to improve the model.
Empirically, it has been found that using small learning rates (e.g., ν ≤ 0.1) can significantly improve the model’s generalization ability. However, smaller learning rates also require more boosting iterations in order to maintain the same training error, thereby increasing the computational time during both training and prediction.
Stochastic Gradient Boosting (Subsampling)
In a follow-up paper [2], Friedman proposed stochastic gradient boosting, which combines gradient boosting with bagging.
In each iteration, a base learner is trained only on a fraction (typically 0.5) of the training set, drawn at random without replacement. This subsampling procedure introduces randomness into the algorithm and helps prevent the model from overfitting.
Like in bagging, subsampling also allows us to use the out-of-bag samples (samples that were not involved in building the next base learner) in order to evaluate the performance of the model, instead of having an independent validation data set. Out-of-bag estimates often underestimate the real performance of the model, thus they are used only if cross-validation takes too much time.
Another strategy to reduce the variance of the model is to randomly sample the features considered for split in each node of the tree (similar to random forests).
The pseudocode for gradient tree boosting with shrinkage is shown below:

Gradient Tree Boosting for Classification
The same algorithm of gradient tree boosting can also be used for classification tasks. However, since the sum of the trees Fₘ(x) can be any continuous value, it needs to be mapped into a probability (a value between 0 and 1). This mapping depends on the type of the classification problem (binary or multiclass).
Note that gradient-boosted trees are always regression trees, even when they are used for classification problems (since they are built to approximate the negative gradient of the loss function).
Binary Classification
In binary classification problems, we use the sigmoid function σ(x) to model the probability that a sample **** belongs to the positive class (similar to logistic regression):

This means that F(x) represents the predicted log odds, i.e., the logarithm of the odds ratio:

The initial model _F_₀(x) is given by the log odds of the positive class in the training set:

As in logistic regression, we use the binary log loss as our loss function:

where pᵢ = σ(Fₘ₋₁(xᵢ)) represents the predicted probability of the previous ensemble that sample xᵢ belongs to the positive class.
We can compute the gradient of this loss with respect to Fₘ₋₁(xᵢ) using the chain rule:

We used here the fact that the derivative of the sigmoid function is:

Therefore, the pseudo-residual at point xᵢ is:

i.e., the pseudo-residual is simply the actual label of xᵢ minus the predicted probability of the previous ensemble that this sample belongs to the positive class.
With regression trees as base learners, we need to find the optimal output values γⱼₘ for each leaf node:

There is no closed-form solution to this optimization problem (note that γ is added to the log-odds and not to the class probabilities directly). Therefore, we approximate the solution by performing a single Newton-Raphson step.
As a reminder, Newton’s method tries to find the minimum of a twice-differentiable function f: R → R, by building a sequence {xₖ} from an initial guess _x_₀ ∈ R . This sequence is constructed from the second-order Taylor approximations of f around the elements xₖ.
The second-order Taylor expansion of f around xₖ is:

The next element in the sequence xₖ₊₁ is chosen as to minimize this quadratic expansion in t. The minimum can be found by setting the derivative of this expansion to 0:

Thus, the minimum is achieved for:

Therefore, Newton’s method performs the following iteration:

Similarly, we can write the second-order Taylor expansion of the loss function around the point Fₘ₋₁(x):

Taking the derivative of this expression with respect to γ yields:

Therefore, the derivative of the total loss induced by the samples in region Rⱼₘ is:

Equating this derivative to 0 gives us the optimal output value for region Rⱼₘ:

The numerator of this expression is simply the sum of the derivatives of the loss function at the data points in region Rⱼₘ (their pseudo-residuals), while the denominator is the sum of the second derivatives of the loss functions at the same points.
We have already found the first derivative of log loss:

We now need to find its second derivative. This derivative can be simply computed by taking the derivative of the first derivative of log loss:

Therefore, we can write:

Example on a Toy Dataset
Assume that we are given the following data set for a binary classification problem:

The goal is to predict whether a customer will buy a given product based on three attributes: the customer’s age, level of income (Low, Medium or High) and level of education (High School or College).
To solve this problem we are going to use an ensemble of gradient-boosted trees with a maximum depth of 2, and a learning rate of ν = 0.5 (we use a relatively large learning rate for illustration purposes). The trees will use only binary splits (as in the CART algorithm).
First, we initialize the model with a constant value, which is the log odds of the positive class:

Next, we calculate the pseudo-residuals. For the log loss function, these are simply the actual labels minus the predicted labels:

We now fit a regression tree to the pseudo-residuals. We start by finding the best split for the root node. In regression trees, at every node we select a split that yields the highest reduction in the variance of the values stored at that node (the pseudo-residuals in our case).
The mean of the residuals at the root node is:

Therefore, the variance is just the average of their squared values:

We will now compute the reduction in the variance that can be achieved by every possible split in every feature. We start from the two categorical attributes:

For the Age attribute, we sort the pseudo-residuals by age, and then consider each middle point between two consecutive ages as a candidate split point:


The split that provides the highest reduction in the variance is Age < 26.5. Therefore, the first level of the tree looks as follows:

The variance of the residuals in the left child node is 0, thus there is no need to split it anymore. We now need to find the best split for the right child node.
First, let’s compute the variance of the residuals at the right child node:

We now consider all the possible splits for the four samples that belong to the right child node (samples 2, 4, 5, 6):


In this case we have multiple candidate splits that lead to the maximum reduction in the variance (0.0835). Let’s choose arbitrarily the split Income = Medium. The resulting regression tree is:

Next, we compute the optimal output values for the leaf nodes (the γ coefficients). Note that since this is our first tree, the most recent predicted probability for all the samples is p = σ(_F_₀(x)) = 0.667.
The output value for the leftmost leaf is:

Similarly, the output values for the other two leaves are:

Thus, we get the following predictions from the first regression tree:

We now scale these predictions by the learning rate and add them to the predictions of the previous iteration. We then use the new predictions to calculate the pseudo-residuals for the next iteration:

We now fit another regression tree to the pseudo-residuals. Following the same process as in the previous iteration, we get the following tree (verify that this is indeed the resulting tree!):

Next, we compute the output values for the leaf nodes. Note that this time the most recent predicted probabilities p = σ(_F_₁(x)) are not the same for all the samples. Going from the leftmost leaf node to the rightmost node we obtain:

Thus, we get the following predictions from the second tree:

We now scale these predictions by the learning rate and add them to the predictions of the previous model:

We can see that after three iterations, our ensemble correctly classifies all the samples in the training set. Hurray!
Let’s see how we can use this ensemble to make predictions on new samples. Assume that we have a new customer with the following attributes: Age = 33, Income = Medium, Education = High School.
We first need to find to which terminal region this sample belongs in each tree of the ensemble. In the first tree, the sample belongs to region _R_₂₁ (since [Age < 26.5] is false and [Income = Medium] is true), and in the second tree it belongs to region _R_₂₂ (since [Education = High School] is true and [Age < 30] is false).
Therefore, the predicted log odds that this customer will buy the product is:

And the predicted probability is:

Since p < 0.5, our prediction is that this customer will not buy the product.
Multiclass Classification
In multiclass classification problems, K trees (for K classes) are built at each of the M iterations. The probability that xᵢ belongs to class k is modeled as the softmax of the Fₘ,ₖ(xᵢ) values:

The initial model in this case is given by the prior probability of each class, and the loss function is the cross-entropy loss. The derivation of the coefficients γⱼₘ in this case is left as an exercise to the reader.
Final Notes
You can find the code examples of this article on my github: https://github.com/roiyeho/medium/tree/main/gradient_boosting
The second part of this article can be found here.
Thanks for reading!
References
[1] Friedman, J.H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, 1189–1232.
[2] Friedman, J.H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38, 367–378.