The world’s leading publication for data science, AI, and ML professionals.

XGBoost: The Definitive Guide (Part 2)

Implementation of the XGBoost algorithm in Python from scratch

Image by StockSnap from Pixabay
Image by StockSnap on Pixabay

In the previous article we discussed the XGBoost algorithm and showed its implementation in pseudocode. In this article we are going to implement the algorithm in Python from scratch.

The provided code is a concise and lightweight implementation of the XGBoost algorithm (with only about 300 lines of code), intended to demonstrate its core functionality. As such, it is not optimized for speed or memory usage, and does not include the full spectrum of options provided by the XGBoost library (see https://xgboost.readthedocs.io/ for more details on the features of the library). More specifically:

  1. The code is written in pure Python, whereas the core of the XGBoost library is written in C++ (its Python classes are only thin wrappers over the C++ implementation).
  2. It does not include various optimizations that allow XGBoost to deal with huge amounts of data, such as weighted quantile sketch, out-of-core tree learning, and parallel and distributed processing of the data. These optimizations will be discussed in more detail in the next article in the series.
  3. The implementation currently supports only regression and binary classification tasks, whereas the XGBoost library also supports multi-class classification and ranking problems.
  4. The implementation provides only a small subset of the hyperparameters that exist in the XGBoost library. Specifically, it supports the following hyperparameters:
  • _nestimators (default = 100): the number of regression trees in the ensemble (which is also the number of boosting iterations).
  • _maxdepth (default = 6): the maximum depth (number of levels) of each tree.
  • _learningrate (default = 0.3): the step size shrinkage applied to the trees.
  • _reglambda (default = 1): L2 regularization term applied to the weights of the leaves.
  • gamma (default = 0): minimum loss reduction required to split a given node.

For consistency, I have kept the same names and default values of these hyperparameters as they are defined in the XGBoost library.

Despite the above limitations, the results obtained by the provided implementation are comparable to those obtained by the XGBoost library on a number of benchmark data sets (as will be demonstrated later in the article).

The Code Structure

The source code consists of five classes:

  1. XGBNode represents a single node in the regression tree.
  2. XGBTree represents a regression tree in the ensemble.
  3. XGBBaseModel is a base estimator for the XGBoost models.
  4. XGBRegressor is a subclass of XGBBaseModel that provides an estimator for regression tasks.
  5. XGBClassifier is a subclass of XGBBaseModel that provides an estimator for binary classification tasks.

For better readability and easy maintenance, each class is written in its own module (.py) file.

The complete source code can be found on my github: https://github.com/roiyeho/medium/tree/main/xgboost/xgboost_from_scratch

We will now go through each piece of the code step-by-step.

The XGBNode Class

This class represents a node in the regression tree. It implements the core functionality of the algorithm for finding the best split at a given node.

In the constructor of the class, we initialize the node to be a non-leaf node, and set the pointers to its left and right child nodes to be None:

import numpy as np

class XGBNode:
    """A node object that recursively builds itself to construct a regression tree
    """
    def __init__(self):
        self.is_leaf: bool = False
        self.left_child: XGBNode = None
        self.right_child: XGBNode = None

The build() method recursively builds the node until one of the following stopping criteria is reached:

  1. There is only one sample left at the node.
  2. The current level of the tree has reached the maximum depth.
  3. The best gain (reduction in the loss) we can obtain from splitting the node is less than gamma.
def build(self, X, grads, hessians, curr_depth, max_depth, reg_lambda, gamma):
    """Recursively build the node until a stopping criterion is reached
    """
    if len(X) == 1 or curr_depth >= max_depth:
        # Stopping criterion (1): there is only one sample left at this node
        # Stopping criterion (2): max depth of the tree has been reached 
        self.is_leaf = True
        self.weight = self.calc_leaf_weight(grads, hessians, reg_lambda)
        return

    best_gain, best_split = self.find_best_split(X, grads, hessians, reg_lambda)

    if best_gain < gamma:
        # Stopping criterion (3): the best gain is less than the minimum split gain 
        self.is_leaf = True
        self.weight = self.calc_leaf_weight(grads, hessians, reg_lambda)
        return        
    else:
        # Split the node according to the best split found
        feature_idx, threshold, left_samples_idx, right_samples_idx = best_split

        self.split_feature_idx = feature_idx
        self.split_threshold = threshold

        self.left_child = XGBNode()
        self.left_child.build(X[left_samples_idx],
                            grads[left_samples_idx],
                            hessians[left_samples_idx],
                            curr_depth + 1,
                            max_depth, reg_lambda, gamma)

        self.right_child = XGBNode()
        self.right_child.build(X[right_samples_idx],
                            grads[right_samples_idx],
                            hessians[right_samples_idx],
                            curr_depth + 1,
                            max_depth, reg_lambda, gamma)

If one of the stopping criteria has been reached, we turn the current node into a leaf node and then calculate its weight (score) by calling the helper method _calc_leafweight():

def calc_leaf_weight(self, grads, hessians, reg_lambda):
    """Calculate the optimal weight of this leaf node (eq.(5) in [1])
    """
    return -np.sum(grads) / (np.sum(hessians) + reg_lambda)

On the other hand, if the node can be split, we call the helper method _find_bestsplit(), which iterates over every possible split and returns the split with the highest gain:

def find_best_split(self, X, grads, hessians, reg_lambda):
    """Scan through every feature and find the best split point (Algorithm 1 in [1])
    """
    G = np.sum(grads)
    H = np.sum(hessians)

    best_gain = float('-inf')   
    best_split = None

    # Iterate over all the possible features
    for j in range(X.shape[1]):
        G_left, H_left = 0, 0

        # Sort the samples according to their value in the current feature
        sorted_samples_idx = np.argsort(X[:, j])

        # Calculate the gain of every possible split point
        for i in range(X.shape[0] - 1):   
            G_left += grads[sorted_samples_idx[i]]
            H_left += hessians[sorted_samples_idx[i]]

            G_right = G - G_left
            H_right = H - H_left
            curr_gain = self.calc_split_gain(G, H, G_left, H_left, G_right, H_right, reg_lambda)

            if curr_gain > best_gain:
                # Update the properties of the best split
                best_gain = curr_gain     
                feature_idx = j 
                threshold = X[sorted_samples_idx[i]][j]
                left_samples_idx = sorted_samples_idx[:i + 1]
                right_samples_idx = sorted_samples_idx[i + 1:]
                best_split = (feature_idx, threshold, left_samples_idx, right_samples_idx)

    return best_gain, best_split

The _find_bestsplit() function uses the following helper function to compute the gain obtained by a given split:

def calc_split_gain(self, G, H, G_left, H_left, G_right, H_right, reg_lambda):
    """Compute the loss reduction (eq. (7) in [1])
    """
    def calc_term(g, h):
        return g**2 / (h + reg_lambda)

    gain = calc_term(G_left, H_left) + calc_term(G_right, H_right) - calc_term(G, H)
    return 0.5 * gain

Lastly, the predict() method returns the score of a given sample according to the weight of the leaf to which it is mapped to by the tree. It uses the best split thresholds saved in the tree nodes in order to determine which path in the tree to follow.

def predict(self, x):
    """Return the score of a given sample x
    """
    if self.is_leaf:
        return self.weight
    else:
        if x[self.split_feature_idx] <= self.split_threshold:
            return self.left_child.predict(x)
        else:
            return self.right_child.predict(x)

The XGBTree Class

This class represents a single regression tree. In the constructor of the class we initialize the root node of the tree to None:

from xgb_node import XGBNode

class XGBTree:
    """A single tree object that will be used for Gradient Boosting
    """
    def __init__(self):
        self.root: XGBNode = None

In the build() method, we simply call the build() method of the root node and pass to it the hyperparameters of the training algorithm:

def build(self, X, grads, hessians, max_depth, reg_lambda, gamma):
    """Recursively build the root node of the tree 
    """
    self.root = XGBNode()
    curr_depth = 0
    self.root.build(X, grads, hessians, curr_depth, max_depth, reg_lambda, gamma)

Lastly, the predict() method returns the score of a given sample by delegating the call to the predict() method of the root node:

def predict(self, x):
    """Return the score of a given sample x
    """
    return self.root.predict(x)

The XGBBaseModel Class

We now build the base class for the Xgboost estimators. To make this class compatible with the Scikit-Learn estimator API, we make it a subclass of BaseEstimator (from sklearn.base) and also implement the fit() and predict() methods. This will allow us to integrate this estimator with other Scikit-Learn mechanisms such as pipelines and grid search.

We first import the required libraries:

import numpy as np

from sklearn.base import BaseEstimator
from xgb_tree import XGBTree
from typing import List
from abc import ABC, abstractmethod

In the constructor of the class, we initialize the hyperparameters of the model:

class XGBBaseModel(ABC, BaseEstimator):
    """Base class for the XGBoost estimators
    """
    def __init__(
        self,
        n_estimators=100,     # The number of trees (boosting rounds)
        max_depth=6,          # Maximum depth of a tree        
        learning_rate=0.3,    # Step size shrinkage applied to the leaf weights
        reg_lambda=1,         # L2 regularization term on the leaf weights
        gamma=0,              # Minimum loss reduction required to split a node
        verbose=0             # Verbosity of the log messages (change to 1 for debug mode)
    ):
        self.n_estimators = n_estimators        
        self.max_depth = max_depth       
        self.learning_rate = learning_rate
        self.reg_lambda = reg_lambda
        self.gamma = gamma
        self.verbose = verbose

The fit() method builds an ensemble of XGBoost trees for the given data set:

def fit(self, X, y):
    """Build an ensemble of trees for the given training set
    """
    # Get the initial prediction of the ensemble
    self.base_pred = self.get_base_prediction(y)

    self.estimators: List[XGBTree] = []
    for i in range(self.n_estimators):
        # Compute the first and second order gradients of the loss function with respect
        # to the predictions of the current ensemble
        out = self.get_output_values(X)
        grads = self.calc_gradients(y, out)
        hessians = self.calc_hessians(y, out)

        # Add a new tree to the ensemble
        tree = XGBTree()
        tree.build(X, grads, hessians, self.max_depth, self.reg_lambda, self.gamma)
        self.estimators.append(tree)

        if self.verbose and i % 10 == 0:
            print(f'Boosting iteration {i}')
    return self

The fit() method uses a helper method called _get_outputvalues() to get the predictions of the current ensemble for the given data set. Note that the output values are not necessarily the same as the final predicted labels of the model. For example, in classification tasks the output values of the ensemble are the predicted log odds and not the the class labels.

def get_output_values(self, X):
    """Return the predicted values of the ensemble for the given data set
    """
    # Initialize the output values with the base prediction
    output = np.full(X.shape[0], self.base_pred)

    # Add the predictions of the base trees scaled by the learning rate
    if len(self.estimators) > 0:
        for i in range(len(X)):            
            output[i] += np.sum(self.learning_rate * estimator.predict(X[i]) 
                                    for estimator in self.estimators)
    return output

Finally, we define several abstract methods that need to be implemented by the concrete estimators:

@abstractmethod
def get_base_prediction(self, y):
    """Return the initial prediction of the model"""
    pass

@abstractmethod
def calc_gradients(self, y, out):
    """Calculate the first order gradients""" 
    pass

@abstractmethod
def calc_hessians(self, y, out):
    """Calculate the second order gradients"""
    pass

@abstractmethod
def predict(self, X):
    """Return the final predicted labels for the given samples"""
    pass

The XGBRegressor Class

This is a subclass of XGBBaseModel that is used to solve regression tasks. To implement this class, we first import the required libraries and functions:

import numpy as np

from xgb_base_model import XGBBaseModel
from sklearn.metrics import r2_score

The constructor of the class simply passes the hyperparameters to the constructor of the base estimator:

class XGBRegressor(XGBBaseModel):
    """An XGBoost estimator for regression tasks
    """
    def __init__(
        self, 
        n_estimators=100, 
        max_depth=6,         
        learning_rate=0.3, 
        reg_lambda=1, 
        gamma=0,
        verbose=0
    ):
        super().__init__(n_estimators, max_depth, learning_rate, reg_lambda, gamma, verbose)

The base prediction of the regressor is simply the mean of the target values:

def get_base_prediction(self, y):
    # The initial prediction is the mean of the targets
    return np.mean(y)

The gradients and Hessians of the square loss are computed using the equations given in the previous article (see the section "XGBoost for Regression"):

def calc_gradients(self, y, out):
    # The first order gradients are twice the residuals
    grads = 2 * (out - y)
    return grads

def calc_hessians(self, y, out):
    # The second order gradients are equal to the constant 2
    hessians = np.full(len(y), 2)
    return hessians

The predict() method returns the output values from the ensemble (in regression tasks the predicted labels are the same as the output values from the ensemble):

def predict(self, X):
    # The predicted labels are the same as the output values
    y_pred = self.get_output_values(X)
    return y_pred

Lastly, we add a score() method to return the _R_² score of the model on a given data set:

def score(self, X, y):
    y_pred = self.predict(X)
    return r2_score(y, y_pred)

Regression Example

After all this hard work, we are now ready to test our XGBoostRegressor!

First, we will use the make_regression() function from Scikit-Learn to generate a synthetic data set. The default settings of this function generate a random linear regression data with 100 samples and 100 features, out of which only 10 are informative (i.e., correlated with the target).

from sklearn.datasets import make_regression
X, y = make_regression(random_state=0)

Note that tree-based models such as XGBoost do not require scaling or normalization of the features.

Next, we split the data set into 80% training and 20% test:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

For evaluating different models on this data set, let’s write a general utility function to train a given model, measure its training time and evaluate its performance both on the training and test sets:

def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    print(name)
    print('-' * 30)    
    train_start_time = time.time()
    model.fit(X_train, y_train)
    elapsed = time.time() - train_start_time
    print(f'Training time: {elapsed:.3f} sec')

    train_score = model.score(X_train ,y_train)
    print(f'R2 score (train): {train_score:.4f}')
    test_score = model.score(X_test, y_test)
    print(f'R2 score (test): {test_score:.4f}')

Let’s start by evaluating our implementation of XGBRegressor on the data set. We will use the default hyperparameter settings defined in the class.

from xgb_regressor import XGBRegressor

my_xgb_reg = XGBRegressor()
evaluate_model('XGBoost (custom)', my_xgb_reg, X_train, y_train, X_test, y_test)

The results we get are:

XGBoost (custom)
------------------------------
Training time: 14.522 sec
R2 score (train): 1.0000
R2 score (test): 0.3218

Our model obtains a perfect _R_² score on the training set but a relatively low score on the test set.

Let’s compare its performance to the XGBClassifier from the xgboost library. First, make sure that you have the library installed by running the following command:

pip install xgboost

We can now create an instance from the XGBRegressor class defined in this package. Since it has the same name as our class, we will use its fully qualified name (including the module name):

import xgboost

original_xgb_reg = xgboost.XGBRegressor()
evaluate_model('XGBoost (original)', original_xgb_reg, X_train, y_train, X_test, y_test)

The results we get are:

XGBoost (original)
------------------------------
Training time: 1.074 sec
R2 score (train): 1.0000
R2 score (test): 0.3077

Surprisingly, our model achieves a slightly better _R_² test score than the original XGBoost! However, its training time is much slower (about 14 times slower). This is due to the fact that we have not implemented any runtime optimizations (and our code is written in Python rather than C++).

For completeness, let’s also compare these models to the classical gradient boosting and histogram-based gradient boosting provided by Scikit-Learn:

from sklearn.ensemble import GradientBoostingRegressor
gboost_reg = GradientBoostingRegressor(random_state=0)
evaluate_model('Gradient boosting', gboost_reg, X_train, y_train, X_test, y_test)

from sklearn.ensemble import HistGradientBoostingRegressor
hist_gboost_reg = HistGradientBoostingRegressor(random_state=0)
evaluate_model('Histogram-based gradient boosting', hist_gboost_reg, X_train, y_train, X_test, y_test)
Gradient boosting
-----------------------------------
Training time: 0.129 sec
R2 score (train): 1.0000
R2 score (test): 0.4376

Histogram-based gradient boosting
-----------------------------------
Training time: 0.188 sec
R2 score (train): 0.9756
R2 score (test): 0.3984

In this case, the best test score is obtained by the classical gradient boosting algorithm. Note that all models have been used with their default hyperparameters, thus these results may change after hyperparameter tuning.


Let’s also evaluate our implementation on a real-world data set, namely the California housing data set, available from Scikit-Learn. This time we will write the evaluation code a bit more succinctly by defining all the models in a list and then calling the evaluation function inside a loop:

from sklearn.datasets import fetch_california_housing

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
names = ['XGBoost (custom)', 'XGBoost (original)', 'Gradient boosting', 'Histogram-based gradient boosting', 
         'Linear regression']
models = [XGBRegressor(), 
          xgboost.XGBRegressor(), 
          GradientBoostingRegressor(random_state=0), 
          HistGradientBoostingRegressor(random_state=0),
          LinearRegression()]

for name, model in zip(names, models):
    evaluate_model(name, model, X_train, y_train, X_test, y_test)

We get the following results:

XGBoost (custom)
-----------------------------------
Training time: 434.099 sec
R2 score (train): 0.9026
R2 score (test): 0.8253

XGBoost (original)
-----------------------------------
Training time: 1.003 sec
R2 score (train): 0.9396
R2 score (test): 0.8360

Gradient boosting
-----------------------------------
Training time: 2.918 sec
R2 score (train): 0.8027
R2 score (test): 0.7774

Histogram-based gradient boosting
-----------------------------------
Training time: 0.975 sec
R2 score (train): 0.8801
R2 score (test): 0.8381

Linear regression
-----------------------------------
Training time: 0.007 sec
R2 score (train): 0.6089
R2 score (test): 0.5943

The test _R_² score of our model is slightly worse than the score of the XGBoost library (0.825 compared to 0.836), but is significantly higher than the score of the classical gradient boosting algorithm (0.777). As expected, the training time of our model is much longer compared to the other models.

The XGBClassifier Class

This is also a subclass of XGBBaseModel, which is used to solve classification tasks (only binary classification is currently supported). To implement this class, we first import the required libraries and functions:

import numpy as np

from xgb_base_model import XGBBaseModel
from sklearn.metrics import accuracy_score

The constructor of the class simply passes the hyperparameters to the constructor of the base estimator:

class XGBClassifier(XGBBaseModel):
    """An XGBoost estimator for binary classification tasks
    """
    def __init__(
        self, 
        n_estimators=100, 
        max_depth=6,         
        learning_rate=0.3, 
        reg_lambda=1, 
        gamma=0,
        verbose=0
    ):
        super().__init__(n_estimators, max_depth, learning_rate, reg_lambda, gamma, verbose)

The base prediction of the classifier is the log odds of the positive class:

def get_base_prediction(self, y):
    # The initial prediction is the log odds of the positive class
    prob = np.sum(y == 1) / len(y)
    return np.log(prob / (1 - prob))

To compute the gradients of log loss we will need the sigmoid function, so let’s implement it first:

def sigmoid(self, x):
    return 1 / (1 + np.exp(-x))

The gradients and Hessians of the log loss are computed using the equations given in the previous article (see the section "XGBoost for Classification"):

def calc_gradients(self, y, out):
    # The first order gradients are the residuals between the predicted probabilities 
    # (sigmoid of the log odds) and the true labels
    prob = self.sigmoid(out)    
    grads = prob - y
    return grads

def calc_hessians(self, y, out):
    # The second order gradients are p(1 - p) where p is the predicted probability
    prob = self.sigmoid(out)
    hessians = prob * (1 - prob)
    return hessians

Similar to other probabilistic classifiers in Scikit-Learn (such as LogisticRegression), our classifier will provide both _predictproba() and predict() methods.

The method _predictproba() returns the predicted probabilities of the positive class for the given samples. These probabilities are computed as the sigmoid of the log odds returned from the ensemble:

def predict_proba(self, X):
    # The predicted probability is the sigmoid of the log odds
    log_odds = self.get_output_values(X)
    prob = self.sigmoid(log_odds)
    return prob

The method predict() returns class labels for the samples in the given data set, using 0.5 as the threshold for assigning a sample to the positive class:

def predict(self, X):
    # Assign a label of 1 if the probability of the positive class is > 0.5,
    # otherwise assign a label of 0
    prob = self.predict_proba(X)
    y_pred = np.where(prob > 0.5, 1, 0)
    return y_pred

Lastly, we add a score() method that returns the accuracy of the model on a given data set:

def score(self, X, y):
    y_pred = self.predict(X)
    return accuracy_score(y, y_pred)

Classification Example

For evaluating our XGBoostClassifier, we will use the make_classification() function from Scikit-Learn. This function creates clusters of normally distributed points (100 points by default) around vertices of an n-dimensional hypercube (n = 2 by default) with sides of length defined by a hyperparameter called _classsep (equal to 1.0 by default).

First, we will use the function to generate 1,000 sample points with a class separation of 0.1:

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, class_sep=0.1, random_state=0)

Similar to the regression example, we will write a general utility function to evaluate a given model and measure its training time:

def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    print(name)
    print('-' * 35)
    train_start_time = time.time()
    model.fit(X_train, y_train)
    elapsed = time.time() - train_start_time
    print(f'Training time: {elapsed:.3f} sec')

    train_score = model.score(X_train ,y_train)
    print(f'Accuracy (train): {train_score:.4f}')
    test_score = model.score(X_test, y_test)
    print(f'Accuracy (test): {test_score:.4f}')
    print()

We will compare the following algorithms:

names = ['XGBoost (custom)', 'XGBoost (original)', 'Gradient boosting', 'Histogram-based gradient boosting', 
         'Logistic regression']
models = [XGBClassifier(), 
          xgboost.XGBClassifier(), 
          GradientBoostingClassifier(random_state=0), 
          HistGradientBoostingClassifier(random_state=0),
          LogisticRegression()]

for name, model in zip(names, models):
    evaluate_model(name, model, X_train, y_train, X_test, y_test)

The results we get are:

XGBoost (custom)
-----------------------------------
Training time: 39.782 sec
Accuracy (train): 1.0000
Accuracy (test): 0.6900

XGBoost (original)
-----------------------------------
Training time: 1.111 sec
Accuracy (train): 1.0000
Accuracy (test): 0.6850

Gradient boosting
-----------------------------------
Training time: 0.409 sec
Accuracy (train): 0.9213
Accuracy (test): 0.6550

Histogram-based gradient boosting
-----------------------------------
Training time: 0.810 sec
Accuracy (train): 1.0000
Accuracy (test): 0.6600

Logistic regression
-----------------------------------
Training time: 0.007 sec
Accuracy (train): 0.5962
Accuracy (test): 0.5300

Our model achieves the highest accuracy on the test set! On the down side, it is significantly slower than the other models (about 36 times slower than the XGBoost library).


Exercises

The provided implementation can be improved and extended in many ways. For example, you can try to add the following features to the codebase:

  1. Add support for multi-class classification.
  2. Add support for categorical features.
  3. Implement early stopping, i.e., stop the boosting once the score on the validation set does not improve for a specified number of rounds.
  4. Add support for subsampling, i.e., train each tree only on a random subset of the training data.
  5. Add additional hyperparameters, such as _min_classweight that defines the minimum sum of instance weight (Hessian) needed in a child node.

In the next article we will discuss the various hardware and software optimizations implemented by the XGBoost library that allow it to run so fast.

Thanks for reading!

References

[1] Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.

California housing data set info: Citation: Pace, R. Kelley and Ronald Barry (1997), Sparse Spatial Autoregressions, Statistics and Probability Letters, 33, 291–297. License: Creative Commons CC0: Public Domain.


Related Articles