
XGBoost (short for eXtreme Gradient Boosting) is an open-source library that provides an optimized and scalable implementation of gradient boosted decision trees. It incorporates various software and hardware optimization techniques that allow it to deal with huge amounts of data.
Originally developed as a research project by Tianqi Chen and Carlos Guestrin in 2016 [1], XGBoost has become the go-to solution for solving supervised learning tasks on structured (tabular) data. It provides state-of-the-art results on many standard regression and classification tasks, and many Kaggle competition winners have used XGBoost as part of their winning solutions.
Although significant progress has been made using deep neural networks for tabular data, they are still outperformed by Xgboost and other tree-based models on many standard benchmarks [2, 3]. In addition, XGBoost requires much less tuning than deep models.
The main innovations of XGBoost with respect to other gradient boosting algorithms include:
- Clever regularization of the decision trees.
- Using second-order approximation to optimize the objective (Newton boosting).
- A weighted quantile sketch procedure for efficient computation.
- A novel tree learning algorithm for handling sparse data.
- Support for parallel and distributed processing of the data.
- Cache-aware block structure for out-of-core tree learning.
In this series of articles we will cover XGBoost in depth, including the mathematical details of the algorithm, implementation of the algorithm in Python from scratch, an overview of the XGBoost library and how to use it in practice.
In this first article of the series, we are going to derive the XGBoost algorithm step-by-step, provide an implementation of the algorithm in pseudocode, and then illustrate its working on a toy data set.
The description of the algorithm given in this article is based on XGBoost’s original paper [1] and the official documentation of the XGBoost library (https://xgboost.readthedocs.io/). However, the article goes beyond the existing documentation in the following respects:
- It explains every step of the mathematical derivation in detail. Understanding the mathematical details of the algorithm will help you grasp the meaning of the various hyperparameters of XGBoost (and there are quite a lot of them) and how to tune them in practice.
- Provides a complete pseudocode of the algorithm (the pseudocode in [1] only describes specific parts of the algorithm in a very concise way).
- Goes through a detailed numerical example of applying XGBoost to a toy data set.
This article assumes that you are already familiar with gradient boosting. If not, please check out the article below:
Now let’s dive in!
The XGBoost Algorithm
Recall that in supervised learning problems, we are given a training set with n labeled samples: D = {(x₁, _y_₁), (x₂, _y_₂), … , (xₙ, yₙ)}, where xᵢ is a m-dimensional vector that contains the features of sample i, and yᵢ is the label of that sample. Our goal is to build a model whose predictions are as close as possible to the true labels.
Similar to gradient tree boosting, XGBoost builds an ensemble of regression trees, which consists of K additive functions:

where K is the number of trees, and F is the set of all possible regression tree functions. Note that we use here f(x) to denote the hypothesis of the base models instead of our typical notation h(x), since the letter h will be used later to represent another entity (the Hessian of the loss function).
Given a loss function L(y, F(x)) that measures the difference between the true label y and the ensemble’s prediction F(x), XGBoost aims to find an ensemble that minimizes the loss on the training set, while also not being overly complex. To that end, it defines the following regularized cost function as the objective function of the model:

where ω(fₖ) is the complexity of the tree fₖ, which will be defined in detail later. Similar to ridge regression, this cost function consists of two parts: the total loss of the model on the training set and a regularization term that penalizes the complexity of the base trees.
Note that in XGBoost regularization is an integral part of the tree learning objective, rather than being imposed by external heuristics such as limiting the maximum tree depth or maximum number of leaves as in other tree-based models.
Additive Training
Since the objective function J includes functions as parameters (the regression tree functions), it cannot be optimized using traditional optimization methods such as gradient descent. Instead, the model is trained in a greedy stepwise manner, by adding one new tree at a time.
Formally, let Fₖ be the ensemble at the _k-_th iteration:

At this iteration, we add to the previous ensemble a tree fₖ, __ which minimizes the following objective:

That is, we would like to find a tree fₖ that reduces the overall training loss, but also is not too complex (has a low complexity ω(fₖ)).
Newton Boosting
Finding the best tree fₖ for an arbitrary loss function L is computationally infeasible, since it would require us to enumerate all the possible trees and pick the best one. Instead, we use an iterative optimization approach: in every boosting iteration we choose a tree fₖ that will get us one step closer to the minimum cost.
In the original (unextreme) gradient boosting algorithm, the function fₖ was chosen as the one that pointed in the negative gradient direction of the loss function with respect to the predictions of the previous ensemble. This made the optimization process work as gradient descent in the function space (hence it is known as functional gradient descent).
XGBoost uses a similar idea, but instead of working as gradient descent in the function space (i.e., using a first-order approximation), it works as a Newton-Raphson method in the function space (i.e., using a second-order approximation). Newton’s method usually converges faster to the minimum, when the second derivative of the objective function is known and easy to compute (which is true of the XGBoost objective function when using common loss functions such as squared loss or log loss).
As a reminder, Newton’s method tries to find the minimum of a twice-differentiable function f: R → R, by building a sequence {xₖ} from an initial guess _x_₀ ∈ R . This sequence is constructed from the second-order Taylor approximations of f around the elements xₖ:

The next element in the sequence xₖ₊₁ is chosen as to minimize the quadratic expansion written on the right-hand side of this equation. We can find the minimum of that expansion by taking its derivative with respect to the step size t, and set it to 0:

Thus, the minimum is achieved for:

Accordingly, Newton’s method performs the following iteration:

This sequence is guaranteed to converge to the minimum x of f quadratically fast, if f is a strongly convex function and provided that _x_₀ is close enough to x:

Coming back to XGBoost, we first write the second-order Taylor expansion of the loss function around a given data point xᵢ:

Here, gᵢ is the first derivative (gradient) of the loss function, and hᵢ is the second derivative (Hessian) of the loss function, both with respect to the predicted value of the previous ensemble at xᵢ:

Plugging this approximation into our objective function at iteration k yields:

After removing the constant terms, we obtain the following simplified objective:

Our goal is to find a tree fₖ(x) (our "step size") that will minimize this objective function. Note that this function only depends on gᵢ and hᵢ, which allows XGBoost to support any custom loss function that is twice differentiable.
Definition of Tree Complexity
Now that we have introduced the training step, we need to define a measure for the complexity of the tree. To that end, we first write a more explicit expression for the function f(x) computed by the regression tree.
Let T be the number of leaves in the tree, w = (_w_₁, …, wₜ) ∈ Rᵗ a vector of the scores (or weights) at the leaf nodes, and q(x): Rᵐ → {1, 2, …, T} a function that assigns each sample x (an m-dimensional vector) to its corresponding leaf index (according to the decision rules in the tree nodes). Then we can write f(x) as follows:

That is, the output value assigned to a sample x is the weight of the leaf node to which this sample is mapped to by the tree.
We now define the complexity of the tree as follows:

That is, the complexity of the tree is a function of the number of its leaves (γT, where γ is a hyperparameter), and the sum of the weights of the leaves squared multiplied by another hyperparameter λ. Increasing γ tends to create smaller trees, while increasing λ encourages assigning smaller weights to the leaves, which in turn decreases the contribution of the tree to the reduction in the loss function (similar to the shrinkage factor in gradient boosting).
Using this definition for the tree complexity, we can now rewrite the objective function at step k as follows:

where Iⱼ = {i|q(xᵢ) = j} is the set of indices of the samples that are assigned to the j-th leaf. Note that in the second line of the equation we changed the index of the summation from i to j, since all the samples at the same leaf node get the same weight.
We can further simplify the objective function by defining:

Gⱼ is the sum of gradients of the loss function with respect to the samples at leaf j, and Hⱼ is the sum of Hessians of the loss function with respect to the same samples.
We can now write:

Our goal is to find the weights for the leaves that will minimize this cost function. To that end, we take the partial derivative of Jₖ with respect to each weight wⱼ:

Setting this derivative to 0 gives us the optimal weight for leaf wⱼ:

Plugging the optimal weights back into the objective function gives us the best objective reduction we can get from this tree:

Learning the Tree Structure
Now that we know how to measure the quality of a given tree, theoretically we could enumerate all the possible trees and pick the best one. In practice this is intractable. Instead, we build the regression tree in a greedy top-down fashion, similar to how standard decision trees are built.
In each node of the tree, we evaluate every possible split by computing how much reduction in the cost function Jₖ can be obtained from that split.
Formally, let Iₗ and Iᵣ be the sets of samples at the left and right child nodes after the split, respectively, and let I = Iₗ ∪ Iᵣ be the set of samples at the parent node. Then the sums of the gradients and Hessians in the child nodes are:

And the sums of gradients and Hessians in the parent node are:

Therefore, the reduction in the cost we get after the split is:

We use this formula to evaluate all the split candidates and then choose the split with the highest gain (highest reduction in the cost). If the gain obtained by the best split is less than 0, i.e., the reduction in the cost is less than γ, then we choose not to split the parent node at all (it will become a leaf in the tree).
Note that the equation for computing the contribution of each leaf node to the gain is very similar to the equation for the optimal weight of a leaf, except that we do not square the sum of the gradients in the numerator.
The Complete Algorithm in Pseudocode
The following pseudocode shows the XGBoost algorithm in its full glory:

The algorithm uses a helper function called Build-Tree to build the next regression tree in the ensemble:

This function in turn uses another helper function Find-Best-Split, which evaluates every possible split at a given node and returns the split with the highest gain (called an exact greedy algorithm in the original XGBoost paper). The best split is returned as a tuple that contains the subsets of samples that go to the left and the right child nodes and also the sum of their gradients and Hessians. The pseudocode of this function is shown below:

As an exercise, try to implement the algorithm in your favorite programming language (a Python implementation is provided in the next article of this series).
In the following sections we will derive more explicit formulas for the gradient and Hessian of the loss function for different types of problems.
XGBoost for Regression
In regression problems, the most commonly used loss function is the squared loss:

Its first derivative with respect to the predicted value of the previous ensemble is:

And its second derivative is:

Therefore, the optimal output value for leaf j in this case is:

And the contribution of this leaf to the reduction in the loss is:

XGBoost for Classification
In binary classification, we use log loss as the loss function:

where pi is the previously predicted probability:

We have already found the first and second order derivatives of log loss in the article on gradient boosting (see the section "Gradient Tree Boosting for Classification"):

Therefore, the optimal output value for leaf j is:

Note that with the exception of lambda, this is the same formula we used to find the output value for the leaves in gradient boosting.
The contribution of this leaf to the reduction in the loss is:

Demonstration on a Toy Dataset
For illustration, we are going to use the same data set that we used to illustrate the classical, unextreme gradient boosting algorithm:

The goal in this data set to predict whether a customer will buy a given product based on three attributes: the customer’s age, level of income (Low, Medium or High) and level of education (High School or College).
To solve this problem we will build an ensemble of XGBoost trees with a maximum depth of 2, and a learning rate of η = 0.5. To keep the example simple, the regularization parameters will be set to 0 (λ = 0 and γ = 0).
First, we initialize the model with a constant value, which is the log odds of the positive class:

Next, we calculate the gradients and Hessians of the samples in the training set:

We now build the first XGBoost tree. We start by finding the best split for the root node. The sums of the gradients and Hessians at the root node are:

Next, we compute the gain that can be achieved by every possible split in every feature. We start from the two categorical attributes:

For the Age attribute, we first sort the gradients and Hessians according to the age value: Ages (sorted) = [22, 25, 28, 30, 35, 40] Gradients (sorted) = [-0.333, -0.333, 0.667, -0.333, 0.667, -0.333] Hessians (sorted) = [0.222, 0.222, 0.222, 0.222, 0.222, 0.222]
We now consider each middle point between two consecutive ages as a candidate split point:


The split with the highest gain is Age < 26.5. Therefore, the first level of the tree looks as follows:

The gradients and the Hessians of the two samples at the left child node are exactly the same, thus there is no need to split it anymore.
We now need to find the best split for the right child node. The sums of the gradients and Hessians of the samples at this node (samples 2, 4, 5, 6) are G = 0.666 and H = 0.888 (as shown inside the node). We consider all the possible splits of these samples:


In this case we have multiple candidate splits that lead to the maximum gain (1.5). Let’s choose arbitrarily the split Income = Medium. The resulting tree is:

Next, we compute the optimal output values (weights) for the leaf nodes. The weight of the leftmost leaf is:

Similarly, the weights of the other two leaves are:

Thus, we get the following predictions from the first XGBoost tree:

We now scale these predictions by the learning rate and add them to the predictions of the previous iteration. We then use the new predictions to calculate the gradients and Hessians for the next iteration:

We now build a second XGBoost tree. Following the same process as in the previous iteration, we get the following tree (verify that this is indeed the correct tree!):

We now scale the predictions of the second tree by the learning rate and add them to the predictions of the previous ensemble:

We can see that after three iterations, our ensemble correctly classifies all the samples in the training set!
The keen reader may have noticed that the resulting ensemble is exactly the same ensemble we have obtained with the classical, unextreme gradient boosting algorithm. This is not surprising, since in classification problems the classical algorithm also uses a second-order approximation to optimize the log loss function. The advantage of XGBoost over the classical algorithm is that it allows us to incorporate a regularization coefficient as part of the objective function (which we did not take advantage of in this example).
Final Notes
All the images are by the author unless stated otherwise.
Thanks for reading!
References
[1] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
[2] Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data?. Advances in Neural Information Processing Systems, 35: 507–520.
[3] Shwartz-Ziv, R., & Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81: 84–90.