
In this article, we are going to see how the ensemble of decision trees trained using Gradient Boosting libraries like XGBoost, LightGBM and CatBoost performs multiclass Classification.
Indeed, an ensemble of decision trees associates a real value to a set of features, so the question is: how do decision tree ensembles transform a scalar value to a multiclass label?
Understanding the underlying workings of classification using this kind of model is crucial, as it impacts performance.
We will enter progressively into the subject following the plan below:
- Reminder and toy example of binary classification in Python
- First binary classification using XGBoost as a regressor
- Second binary classification using XGBoost as a classifier
- Multiclass classification using XGBoost
The versatility of Decision Tree based Ensemble Models
XGBoost, LightGBM, or CatBoost are libraries that share (by default) the same kind of underlying model: decision trees.
These decision trees are combined iteratively, using Gradient Boosting. I.e. the addition of new nodes to the current tree is done so that a non-linear objective, usually the squared error, is optimized. To handle the non-linearity, the objective is linearized using its Gradient and Hessian.
Hence the name Gradient Boosting. More detail in my previous paper:
Predicting with ensemble of decision trees
As a reminder, the prediction process is relatively simple: given a row of data, each decision tree of the ensemble is browsed.
Depending on the value of features, each tree then associates a unique value, attached to the final leaf.
The unique predictions of each tree are then simply summed up to give the overall prediction.
The figure below illustrates this in a simple example, where an ensemble of decision trees models the identity function for an integer between 1 and 4:

For instance, when the input is 1, the first tree generates 8, the second tree -6, and the last one -1. Summing these three values gives 1, which is the expected output.
This example is extracted from my book, Practical Gradient Boosting, on gradient boosting:
Practical Gradient Boosting: A deep dive into Gradient Boosting in Python
Using a single scalar value, the best we can do is perform a binary classification, labelling negative predictions with one class, and positive ones with the other one.
Binary classification without XGBoost
Before discovering this first option, i.e. binary classification with XGBoost as a regressor, let’s show in detail how binary classification is done.
The problem we are trying to solve here is simple: we want to establish a student’s probability of success depending on the number of hours he spends studying his subject.
The figure below shows the data collected, i.e. the number of hours of work and the results: pass or failed.

The standard model that is used for classification is the logistic function. This function is similar to linear regression, except that instead of taking a value in the range ℝ, it generates only values in the range [0, 1]. Its formula is worth being known:

As always in Machine Learning, finding the best parameters for a model, here the logistic function, is done by minimizing an error. Facing a binary problem, where the positive output can be modelled by a 1 and the negative output by a 0, it’s possible to combine both errors in a single expression:

Where the y_k
are the observed samples whereas f(x_k)
are the prediction made by the model f
.
The difficulty with this basic error is that, accordingly to the binary nature of the logistic function, which mainly takes only two values: zero and one, the error with respect to the model parameter m
will also take mainly two values. Hence outside the vicinity of the optimal parameters, the error will be flat.

We could use this formula, and this would work, as long as we provide a pretty good estimate of the optimal parameter. If it’s not the case, we risk ending up in the flat zone where the error is almost constant. In this area, the gradient will be almost zero, and the convergence of the steepest descent will be agonizingly slow.
We need a way to process the error output, which is limited to the range [0, 1] for a given sample to ℝ+ so that there is no more saturation.
With the additional constraint that a null error must remain a null error after the transformation.
The trick is to realize that log(1) is zero, whereas log(0) is –∞.
Therefore the log-loss is used:

Where the y_k
are the observed samples whereas f(x_k)
are the prediction made by the model f
. Note the minus sign in front of the addition operator and the inversion of 1-f(x_k)
with f(x_k)
. This is because log(1)=0
.
Using the log loss, the error is no longer saturated:

The simplest way to minimize this error is to use the steepest descent, which only requires computing the gradient of the error. Many options are possible to do that. Here we are going to use symbolic differentiation using sympy
:
The algorithm found the expected value, 14.77, which is very close to the theoretical one.
Let’s now go back to our subject, binary classification with decision trees and gradient boosting.
Binary classification with XGBoost
Let’s start with a simple example, using the Cleveland Heart Disease Dataset (CC BY 4.0), where the classification is done using regression. As we are performing a binary classification, it is possible to use a simple regression, as we can attach a positive value, 1.0, to positive labels, and a negative value, -1, to negative labels:
The default error used by Xgboost is the squared error. The predictions are rounded to integers, and as you can see thanks to the confusion matrix, the model performs prediction without error.
A similar result can be achieved using directly an XGBoost classifier:
In this case, there is no need to round predictions to get the corresponding class. All the job is done natively by the XGBClassifier. Let’s see how XGBoost handles that.
XGBClassifier trains multiple models
In fact, when you are doing classification with XGBoost, using the XGBClassifier (or xgb.train with the right parameters for classification), XGBoost does in fact train multiple models, one for each class.
The snippet of code below shows how to get more insight into the internals of XGBoost.
More specifically, the predict_proba
the method allows getting access to the raw data generated by the internal models. This clearly reveals that when doing classification XGBoost makes a probability prediction for each class.
The predicted class is then the one with the highest probability.
Looking at the code that is done to integrate XGBoost into sklearn
, we have the confirmation that XGBoost makes multiple predictions:
As can be seen, line 25, argmax
is used to retrieve the index of the class with the highest probability when softprob
is used. In the case where the objective used is softmax
, the prediction is simply cast into integers.
How does XGBoost perform multiclass classification?
Usually, the explanations regarding how XGBoost handle multiclass classification state that it trains multiple trees, one for each class.
This is not exactly the case. In fact, all the trees are constructed at the same time, using a vector objective function instead of a scalar one. I.e. there is an objective for each class.
The XGBoost documentation gives an example of such an objective:
There are two things very interesting in this snippet of code:
- The objective name is
multi:softprob
when using the integrated objective in XGBoost. This is quite confusing, as the aim is not really thesoftprob
, but the log loss of thesoftmax
. This appears clearly in the code, as the gradient is directly thesoftmax.
Butsoftmax
is not the gradient ofsoftmax
, but the gradient of its log loss:

- The other point is that the code uses a variable
hess
that stands for the hessian. However, this is not really the hessian that is used mathematically speaking, but the second derivative. Hence the right name for this would be a laplacian.

Conclusion
We have shown in this paper how classification is handled by XGBoost.
It is crucial to understand that classifying n
classes generate trees n
times more complex.
It is also important to notice that the name of the objective functions exposed in the XGBoost API are not always very explicit. For instance, when doing classification, the objective optimized is not softmax
or softprob
, but their log loss.
If you want to get more detail on Gradient Boosting methods, please have a look at my book:
Practical Gradient Boosting: A deep dive into Gradient Boosting in Python