XGBoost Mathematics Explained

A walkthrough of the Gradient Boosted Trees algorithm’s maths

6 min readNov 11, 2018

1. Introduction

XGBoost (https://github.com/dmlc/xgboost) is one of the most popular and efficient implementations of the Gradient Boosted Trees algorithm, a supervised learning method that is based on function approximation by optimizing specific loss functions as well as applying several regularization techniques.

The original XGBoost paper can be found here: https://arxiv.org/pdf/1603.02754.pdf

The purpose of this post is to explain the mathematics of some critical parts of the paper as well as to give some insights.

XGBoost has a strong mathematical background

2. XGBoost objective function

The objective function (loss function and regularization) at iteration t that we need to minimize is the following:

It is easy to see that the XGBoost objective is a function of functions (i.e. l is a function of CART learners, a sum of the current and previous additive trees), and as the authors refer in the paper [2] “cannot be optimized using traditional optimization methods in Euclidean space”.

3. Taylor’s Theorem and Gradient Boosted Trees

From the reference [1] we can see as an example that the best linear approximation for a function f(x) at point a is:

Taylor linear approximation of a function around a point a

Why we need to use the Taylor approximation?
Because we need to transform the original objective function to a function in the Euclidean domain, in order to be able to use traditional optimization techniques.

Explanation of the above answer:
Let’s take the simplest linear approximation of a function f(x) from [1]:

Initial function can be written as function of Δx only

The “trick” here is that we can transform a function f(x) to a simplest function of Δx around a specific point a by using Taylor’s theorem.

To understand better, remember that before the Taylor approximation the x in objective function f(x) was the sum of t CART trees and after this it becomes a function of the current tree (step t) only.
Note that the objective function must be differentiable.

In our case f(x) is the loss function l, while a is the previous step (t-1) predicted value and Δx is the new learner we need to add in step t.

Using the above in each iteration t we can write the objective (loss) function as a simple function of the new added learner and thus to apply Euclidean space optimization techniques.

As we already said, a is the prediction at step (t-1) while (x-a) is the new learner that we need to add in step (t) in order to greedily minimize the objective.

So, if we decide to take the second-order Taylor approximation, we have:

XGBoost objective using second-order Taylor approximation

Where:

First and second order gradient statistics of the loss function

Finally, if we remove the constant parts, we have the following simplified objective to minimize at step t:

The above is a sum of simple quadratic functions of one variable and can be minimized by using known techniques, so our next goal is to find a learner that minimizes the loss function at iteration t.

Minimizing a simple quadratic function

4. How to build the next learner

Being at iteration t we need to build a learner that achieves the maximum possible reduction of loss, the following questions arrive:

Is it possible to find the optimal next learner?
Is there any way to calculate the gain (loss reduction) after adding a specific learner?

The good news is that there is a way to “measure the quality of a tree structure q” as the authors refer, and the scoring function is the following (see the correspondence to the “simple quadratic function solution” above):

The tree learner structure q scoring function

While the bad news is that it is impossible in terms of required calculations to “enumerate all the possible tree structures q” and thus find the one with maximum loss reduction.

Note that the “quality scoring function” above returns the minimum loss value for a given tree structure, meaning that the original loss function is evaluated by using the optimal weight values. So, for any given tree structure we have a way to calculate the optimal weights in leaves.

In practice what we do in order to build the learner is to:

Start with single root (contains all the training examples)
Iterate over all features and values per feature, and evaluate each possible split loss reduction:
gain = loss(father instances) - (loss(left branch)+loss(right branch))
The gain for the best split must be positive (and > min_split_gain parameter), otherwise we must stop growing the branch.

The above algorithm is called the “Exact Greedy Algorithm” and its complexity is O(n*m) where n is the number of training samples and m is the features dimension.

5. Binary classification with log loss optimization

Let’s take the case of binary classification and log loss objective function:

Binary classification with Cross Entropy loss function

,where y is the real label in {0,1} and p is the probability score.

Note that p (score or pseudo-probability) is calculated after applying the famous sigmoid function into the output of the GBT model x.
The output x of the model is the sum across the the CART tree learners.

So, in order to minimize the log loss objective function we need to find its 1st and 2nd derivatives (gradient and hessian) with respect to x.
In this stats.stackexchange post you can find that gradient = (p-y) and hessian = p*(1-p).

Summarizing the GBT model which — remember — is a sum of CART (tree) learners will try to minimize the log loss objective and the scores at leaves which are actually the weights have a meaning as a sum across all the trees of the model and are always adjusted in order to minimize the loss. This is why we need to apply the sigmoid function in the output of GBT models in case of binary classification probability scoring.