The world’s leading publication for data science, AI, and ML professionals.

XGBoost: Cardinality, the crucial HyperParameter that is always under-considered

Update: discover my new book on Gradient Boosting, Practical Gradient Boosting. It's a deep dive into Gradient Boosting with many examples…

Photo by Patrice Bouchard on Unsplash
Photo by Patrice Bouchard on Unsplash

Update: discover my new book on Gradient Boosting, Practical Gradient Boosting. It’s a deep dive into Gradient Boosting with many examples in python.

Practical Gradient Boosting: An deep dive into Gradient Boosting in Python

When dealing with HyperParameter Tuning, most of the attention is focused on overfitting and using the right regularisation parameters to ensure that a model is not overlearning.

There is however another question that is very important to ask: what is the cardinality of the prediction space? Put another way, how many distinct values can XGBoost and more generally decision trees predict?


XGBoost and boosted trees are discrete models

When using XGBoost, Catboost, or LightGBM it’s absolutely crucial to remember that all these libraries rely on decision trees, whose leaf contains only constant values (Unless you have enabled linear trees for LightGBM. See my other article on the subject.)

Simply put, this means that a given model can only predict a discrete number of values. Other models like Gaussian Processes or basic linear models for instance can predict an infinite number of values.

The code and the plot below illustrate that :

There is no ambiguity regarding the discrete nature of Gradient Boosting Trees, as shown in the figure generated by the code above:

Discrete nature of Gradient Boosted Trees. Plot the author.
Discrete nature of Gradient Boosted Trees. Plot the author.

Gradient Boosting Tress can only predict a finite, discrete set of values.

Why care for cardinality?

Cardinality is important for two reasons:

  • It’s directly linked to the cardinality of your predictions set. Let’s say for instance that you need to predict the income for n types of jobs, and m countries. If your model predicts less than n*m values, your model is likely to be unable to predict incomes correctly. The set of predictions possible is not large enough to capture the complexity of reality.
  • It’s a good indicator of overfitting. If your model cardinality is much higher than the cardinality of your predictions set, the odds are that your model is overfitting.

Computing the cardinality of Gradient Boosted trees

Computing the cardinality of the predictions that a Gradient Boosted Tree model can generate is not easy. It depends on the structure of each tree, and the decision that is taken by each decision node.

In the case of a simple ensemble of trees, where there is only one estimator, the computation is straightforward. There are as many predictions possible as there are leaves. If the tree is complete, i.e. if each branch has reached the max depth, the number of leaves is equal to 2^max_depth.

When there are n estimators, theoretically speaking, each possible prediction of the first tree can be updated by any of the corrections brought by the second tree, which can, in turn, be updated by any of the corrections brought by the third tree, and so on and so forth. Hence, theoretically speaking, a Gradient Booster Tree can generate up to

Formula by the author
Formula by the author

In practice, this value is an upper boundary, as each tree is not independent with respect to the others, for a set of features.

Indeed, considering a leaf of the first tree, a cartesian product of feature intervals allows reaching these leaves. In consequence, when features vary in these intervals, the response of the first tree is always the same value.

The odd that each leaf of the other trees can be reached when varying the feature inside these ranges are low. This explains why the formula above is clearly a max, as only a subset of the leaves is reachable.

Driving Gradient Boosted Tree cardinality

The Hyper Parameters to tune in order to control model cardinality are the same as the ones that drive overfitting. Based on the explanation in the previous section, we see that the basic n_estimators and max_depth can be used to pilot cardinality.

gamma and lamda , as they control node splitting can also be used. If you are using Lightgbm, it’s even simpler, as there exists a param, num_leaves, that defines the maximum number of leaves for a given estimator, whatever the depth.

https://www.buymeacoffee.com/guillaumes0
https://www.buymeacoffee.com/guillaumes0

Conclusion

Cardinality is an essential characteristic of a model based on Gradient Boosted Trees and must be analyzed in relation to the cardinality of the values to predict.

This is a direct consequence of the discrete nature of Gradient Boosted Trees.

It’s important to ensure that the generated model has enough freedom to generate distinct values for each different case considered during the prediction.


Related Articles