The world’s leading publication for data science, AI, and ML professionals.

Gradient Boosting: To Early Stop or Not To Early Stop

How early stopping halves training time for models like LightGBM, XGBoost and CatBoost.

Leveraging early stopping for LightGBM, XGBoost, and CatBoost

Photo by Julian Berengar Sölter
Photo by Julian Berengar Sölter

Gradient-boosted decision trees (GBDTs) currently outperform deep learning in tabular-data problems, with popular implementations such as LightGBM, XGBoost, and CatBoost dominating Kaggle competitions [1]. Early Stopping a popular technique in deep learning – can also be used when training and tuning GBDTs. However, it is common to see practitioners explicitly tune the number of trees in GBDT ensembles, instead of using early stopping. In this article, I show that early stopping can halve training time, while maintaining the same performance as explicitly tuning the number of trees.

By reducing training time, early stopping can lower computational costs and decrease practitioner downtime while waiting for models to run. Such savings are of utmost value in industries with large-scale GBDT applications, such as content recommendation, financial fraud detection, or credit scoring. But how does early stopping reduce training time without harming performance? Let’s dive in.

Gradient-Boosted Decision Trees

Gradient-boosted decision trees (GBDTs) currently achieve state-of-the-art performance in classification and regression problems based on (heterogeneous) tabular data (two-dimensional datasets with diverse column types). Deep learning techniques – although performant in natural language processing and computer vision— are yet to steal the crown in the tabular data domain [2f925f2149e9edb0ac3b49229c-Paper.pdf), 3, 4, 5].

Gradient-boosted decision trees (GBDTs).
Gradient-boosted decision trees (GBDTs).

GBDTs work by sequentially adding decision trees to an ensemble. Unlike with random forests, trees in GBDTs are not independent. Instead, they are trained to correct the mistakes of previous trees. As such, given enough trees, a GBDT model can achieve perfect performance in the training set. Nevertheless, this behavior – referred to as overfitting – is known to harm the model’s ability to generalize to unseen data.

Hyperparameter Tuning and Early Stopping

To optimize the degree of fitting to the training data, practitioners tune several key hyperparameters: the number of trees, the learning rate, the maximum depth of each tree, among others. To find the optimal set of values, several configurations are tested in a separate validation dataset; the model performing best in the holdout data is chosen as the final model.

Another tool that helps fight overfitting is early stopping. Common in deep learning, early stopping is a technique where the learning process is halted if the performance on holdout data is not improving. In GBDTs, this implies not building more trees beyond that point.

Early stopping halts training at the point where loss in the validation set stops to decreasing.
Early stopping halts training at the point where loss in the validation set stops to decreasing.

Although ubiquitous in deep learning, early stopping is not as popular among GBDT users. Instead, it is common to see practitioners tune the number of trees through the aforementioned search process. But what if using early stopping amounts to the same as explicitly tuning the number of trees? After all, both mechanisms aim to find the optimal size of the GBDT ensemble, given the learning rate and other hyperparameters. If that were the case, it could mean that the same performance could be achieved at greatly reduced search time by using early stopping, since it halts the training of time-consuming, unpromising iterations. Let’s test this hypothesis.

Experimental Setup

To this end, with the authors’ permission, I use the public bank-account-fraud dataset recently published at NeurIPS ’22 [6]. It consists of a synthetic replica of a real fraud-detection dataset, having been generated by a privacy-preserving GAN. For an implementation of GBDTs, I opt for LightGBM for its speed and state-of-the-art performance [1, 7]. All the code used in this experiment can be found in this Kaggle notebook.

As mentioned above, to find the optimal set of hyperparameters, the most common approach is to experiment with several configurations. Ultimately, the model that performs best in the validation set is chosen as the final model. I follow this approach, randomly sampling hyperparameters from sensible distributions at each iteration.

To test my hypothesis, I run two parallel random search processes:

  1. Without early stopping, the number of trees parameter is tested uniformly between 10 and 4000.
  2. With early stopping, the maximum number of trees is set to 4000, but ultimately defined by the early stopping criteria. Early stopping monitors cross-entropy loss in the validation set. The training process is only halted after 100 non-improving iterations (the patience parameter), at which point it is reset to its best version.

The following function is used to run each random search trial within an Optuna study (truncated for clarity; full version in the aforementioned notebook):

def _objective(t, dtrain, dval, early_stopping):
    params = {
        'boosting_type': t.suggest_categorical(['gbdt', 'goss']),
        'learning_rate': t.suggest_float(0.01, 0.5, log=True),
        'min_split_gain': t.suggest_float(0.00001, 2, log=True),
        'num_leaves': t.suggest_int(2, 1024, log=True),
        'max_depth': t.suggest_int(1, 15),
        'min_child_samples': t.suggest_int(2, 100, log=True),
        'bagging_freq': t.suggest_categorical([0, 1]),
        'pos_bagging_fraction': t.suggest_float(0, 1),
        'neg_bagging_fraction': t.suggest_float(0, 1),
        'reg_alpha': t.suggest_float(0.00001, 0.1, log=True),
        'reg_lambda': t.suggest_float(0.00001, 0.1, log=True),
    model = lgb.train(
        **params, dtrain,
            4000 if early_stopping
            else trial.suggest_int('num_boost_rounds', 10, 4000)
        valid_sets=dval if early_stopping else None,
            [lgb.early_stopping(stopping_rounds=100)] if early_stopping
            else None))


Since early stopping monitors performance on the validation set, all models are evaluated on an unseen test set, thus avoiding biased results.

Results on the test set. Bottom 20% of trials removed for visualization clarity.
Results on the test set. Bottom 20% of trials removed for visualization clarity.

To early stop or not to early stop? Both approaches achieve similar results. This outcome is consistent both when measuring cross-entropy loss – the metric monitored by early stopping, and recall at 5% FPR – a binary classification metric especially relevant in this dataset’s domain [6]. On the first criterion, the no-early-stopping strategy achieves marginally better results, whereas on the second criteria, it is the early-stopping strategy that has the edge.

In sum, the results of this experiment fail to reject my hypothesis that there is no significant difference between employing early stopping and explicitly tuning the number of trees in GBDTs. Naturally, a more robust evaluation would require experimenting with several datasets, hyperparameter search spaces and random seeds.

Training Time

Part of my hypothesis was also that early stopping reduces average training time by stopping the addition of unpromising trees. Can a meaningful difference be measured?

Distribution of training time in seconds.
Distribution of training time in seconds.

Results confirm the second part of my hypothesis: training times are substantially inferior when using early stopping. Using this strategy – even with a high patience value of 100 iterations – halves the average training time, from 122 seconds to 58 seconds. This implies a reduction of total training time from 3 hours and 23 minutes to 1 hour and 37 minutes.

This reduction comes in spite of the additional computation required by the early stopping mechanism to monitor cross-entropy loss on the validation set, which is accounted for in the measurements presented above.


Gradient-boosted decision-trees (GBDTs) are currently state of the art in problems involving tabular data. I find that using early stopping in the training of these models halves training times, while maintaining the same performance as explicitly tuning the number of trees. This makes popular GBDT implementations like Lightgbm, XGBoost, and CatBoost that much more powerful for applications in large industries, such as Digital Marketing and Finance.

In the future, it would be important to corroborate the findings presented here in other datasets and across other GBDT implementations. Tuning the patience parameter could also prove beneficial, although its optimal value will likely vary for each dataset.

Except where otherwise noted, all images are by the author.


[1] H. Carlens. The State of Competitive Machine Learning in 2022. ML Contests, 2023. [2] Y. Gorishniy, I. Rubachev, V. Khrulkov, and A. Babenko, Revisiting Deep Learning Models for Tabular Data, 35th Conference on Neural Information Processing Systems (NeurIPS 2021). [3] R. Shwartz-Ziv, and A. Armon, Tabular Data: Deep Learning is Not All You Need, Information Fusion 81 (2022): 84–90. [4] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, Deep Neural Networks and Tabular Data: A Survey, IEEE Transactions on Neural Networks and Learning Systems (2022). [5] L. Grinsztajn, E. Oyallon, and G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data?, 36th Conference on Neural Information Processing Systems – Datasets and Benchmarks Track (NeurIPS 2022). [6] S. Jesus, J. Pombal, D. Alves, A. Cruz, P. Saleiro, R. Ribeiro, J. Gama, P. Bizarro, Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation, 36th Conference on Neural Information Processing Systems – Datasets and Benchmarks Track (NeurIPS 2022). [7] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T. Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 31st Conference on Neural Information Processing Systems (NIPS 2017).

Related Articles