XGBoost: its Genealogy, its Architectural Features, and its Innovation

Sequential and Parallel Architectures All in One

Michio Suginoo
Towards Data Science

--

Photo by MARIOLA GROBELSKA on Unsplash

INTRODUCTION

XGBoost (eXtreme Gradient Boost) is a powerful learning algorithm which had outperformed many conventional Machine Learning algos in many competitions in the past.

In a nutshell, XGBoost is equipped with both sequential and parallel architectures all in one: while it is a sequential learning algorithm (additive strategy), it incorporates parallel computation into its architecture in order to enhance the system efficiency.

produced by Michio Suginoo

This post is an introductory overview of XGBoost for beginners and serves a one-stop article that would give you a big picture, if not details, about XGBoost — its genealogy, its architectural features, and its innovative features. At the end I will also suggest a short list of supplemental resources so that the readers can explore more details of the topics covered in the post.

Now, let’s start.

In order to understand the features of XGBoost, we can start with a quick overview of its genealogy.

DISCUSSION

A. Genealogy of XGBoost

From a top-down perspective, XGBoost is a sub-class of Supervised Machine Learning. And, as its name suggests, XGBoost is an advanced variant of Boosting Machine, which is a sub-class of Tree-based Ensemble algorithm, like Random Forest.

Nevertheless, Boosting Machine is fundamentally different from Random Forest in the way how it operates its learning processes.

Boosting Machine

Random Forest runs multiple independent decision trees in parallel and combines their results by averaging all the results. This approach uses random bootstrapping sampling and is often called bagging. In this sense, Random Forest is a parallel learning algorithm.

On the contrary, Boosting Machine uses an additive strategy: that is to “add one new tree at a time” (xgboost developpers, 2022). Boosting Machine runs individual weak/simple decision trees called the base learner in sequence. Simply put, conceptually Boosting Machine is built on a sequential learning architecture.

In this sense, Boosting Machine learns in sequence, while Random Forest does in parallel.

As a reference on Boosting Machine, here is a MIT lecture on Boosting: https://www.youtube.com/watch?v=UHBmv7qCey4

That said, to avoid confusion I should make a footnote here from the perspective of system optimization. XGBoost is also designed to operate parallel computation to enhance an efficient use of computational resources(xgboost developers, n.d.). Overall, XGBoost, while inheriting a sequential learning architecture from Boosting Machine, operates parallel computations for System Optimization.

Gradient Boosting Machine

As its name suggests, XGBoost (eXtreme Gradient Boost) is an advanced variant of Gradient Boosting Machine (GBM), a family member of Boosting Machine.

As a part of its additive strategy, Gradient Boosting Machine (GBM) uses Gradient Descent for optimization. In order to reduce the computational burden, GBM approximates the Objective Function by using the first order term of the Taylor expansion and ignores any higher order terms for its learning optimization. In other words, it uses the first derivative (Gradient) of the Objective Function (Loss Function) to determine the next weak learner predictor. In this way, Gradient Boosting, while retaining the existing weak predictors, adds a new predictor on top of them to reduce the current error in order to incrementally improve the performance.(Friedman, 2000)

B. Algorithmic Features of XGBoost

Newton Boosting Machine

XGBoost extends the idea of Gradient Boosting in the sense that it also uses the second derivative (Hessian: Curvature) of the Objective Function in addition to its first derivative (Gradient) to further optimize its learning process. The method is called the Newton Raphson Method. And Boosting Machine using the Newton Raphson Method is called Newton Boosting. For further discussions on the difference between the Gradient Descent and the Newton Boosting, you can read a paper, Gradient and Newton Boosting for Classification and Regression, by Fabio Sigrist.

Because of the specific architecture of the additive strategy, the second order approximation yields multiple beneficial mathematical properties to streamline the algorithm for further computational efficiency. (Guestrin & Chen, 2016)

Regularization: to address Variance-Bias Trade-off

Jerome Friedman, the architect of Gradient Boosting Machine (Friedman, 2000), articulated the importance of regularization to address bias-variance trade-off, the problem of underfitting-overfitting trade-off, specifically recommending the users to tune three meta-parameters of Gradient Boosting Machine: the number of iterations, the learning rate, and the number of terminal nodes/leaves. (Friedman, 2000, pp. 1203, 1214–1215)

In this context, XGBoost inherited the regularization focus of Gradient Boosting Machine and extended it further.

  • First, XGBoost enables the users to tune the various hyperparameters to constrain the trees: e.g. the number of trees, the depth of an individual tree, the minimum sum of instance weights for partition, the maximum number of boosting rounds, and the number of the nodes/leaves.
  • Second, it allows the users to apply a learning rate, shrinkage, during the learning process. (Guestrin & Chen, 2016, p. 3)
  • Third, it enables the users to use random sampling techniques such as column sub-sampling. (Guestrin & Chen, 2016, p. 3)
  • Fourth, it enables the users to tune L1 and L2 regularization terms.

C. Innovations:

Sparsity-aware Algorithm and Weighted Quantile Sketch

More importantly, XGBoost introduced two innovations: Sparsity-aware Algorithm and Weighted Quantile Sketch. (Chen & Guestrin, 2016, p10)

First, XGBoost has a built-in feature called default direction. This feature captures the pattern of the sparse data structure and determines the direction of the split at each node based on the pattern. Guestrin & Chen present three typical causes for sparsity:

“1) presence of missing values in the data; 2) frequent zero entries in the statistics; and, 3) artifacts of feature engineering such as one-hot encoding.” (Guestrin & Chen, 2016)

In principle, this feature makes XGBoost sparsity-aware algorithm that can handle missing data: the user does not need to impute missing data.

While default direction determines the direction of the split, weighted quantile sketch proposes candidate split points. The following excerpt from Chen and Guestrin’s paper summarizes what it is.

“a novel distributed weighted quantile sketch algorithm … can handle weighted data with a provable theoretical guarantee. The general idea is to propose a data structure that supports merge and prune operations, with each operation proven to maintain a certain accuracy level.” (Guestrin & Chen, 2016)

System Optimization: Efficiency and Scalability

So far, we saw the framework of XGBoost from the perspective of the learning algorithm architecture. Now, we can view it from the perspective of System Optimization.

The native XGBoost API is also innovative in pursuing the computational efficiency, or the system optimization. The API is called eXtreme (X) since XGBoost aims at enabling the users to exploit an eXtreme limit of the given system’s computational capacity, by efficiently allocating computation tasks among the given computational resources — processors (CPU, GPU), memory, and out-of-core (disk space): cache access, block data compression and sharding. (databricks, 2017)

On more about the innovative aspects of the native XGBoost API, here is a great piece outlined by the inventors of XGBoost (Chen & Guestrin) , XGBoost: A Scalable Tree Boosting System.

CONCLUSION

This quick overview of XGBoost went over its genealogy, its architectural features, and its innovation without getting into details.

In a nutshell, XGBoost has a sequential-parallel hybrid architecture in a sense that it inherits its sequential learning architecture from its Boosting Machine genealogy, at the same time, incorporates parallel computation into its architecture in order to enhance the system efficiency.

Since Boosting Machine has a tendency of overfitting, the native XGBoost API has an intense focus on addressing bias-variance trade-off and facilitates the users to apply a variety of regularization techniques through hyperparameter tuning.

If you are interested in an implementation example of the native XGBoost API, you can read my another post, Pair-Wise Hyperparameter Tuning with the Native XGBoost API.

Thanks for reading this post.

Suggested External Resources

For those who want to explore more details of XGBoost, here is a short list of my favorite resources about the algorithm:

ACKNOWLEDGEMENT

I would like to express my gratitude towards TDS’s editing team, especially Katherine Prairie, for their invaluable editing advices during the editing stage.

--

--

CFA, Data Science, Innovation, Paradigm Shift, Paradox Hunting, Teleological Pursuit