Binary Classification: XGBoost Hyperparameter Tuning Scenarios by Non-exhaustive Grid Search and Cross-Validation

Practical example of balancing model performance and computational resource limitations — with code and visualization

Daniel J. TOTH
Towards Data Science

--

Photo by d kah on Unsplash

Introduction

XGBoost or eXtreme Gradient Boosting is one of the most widely used machine learning algorithms nowadays. It is famously efficient at winning Kaggle competitions. Many articles praise it and address its advantage over alternative algorithms, so it is a must-have skill for practicing machine learning.

Although XGBoost is relatively fast, it still could be challenging to run a script on a standard laptop: when fitting a machine learning model, it usually comes with hyperparameter tuning and — although not necessarily — cross-validation. Tuning may increase the computational resource demand exponentially, if all combinations of hyperparameters are taken into account. Consequently, an exhaustive search for optimal parameters may not be a viable option.

This article is about finding a good balance between model performance and computational resource limitations by presenting alternative methods for hyperparameter tuning. I proudly own an Intel Core i5–8250U CPU, which marks below the median of submissions, according to CPUBenchmark. For anyone with such limitations, my code might be interesting.

Technical Terms Overview

Although I take a practical point of view by sharing the code for the process, I would like to add an overview of terms and parameter names used. Please skip this point if you are familiar with them and proceed to data set description and code results.

Autocorrelation
In statistics, the autocorrelation of a real or complex random process is the Pearson correlation between values of the process at different times, as a function of the two times or of the time lag” [1] wiki autocorr

Decision tree learning
It uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves). Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels” [2]

Gradient boosting
Gradient boosting is a machine learning technique for regression, classification and other tasks, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. When a decision tree is the weak learner, the resulting algorithm is called gradient boosted trees, which usually outperforms random forest. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function” [3] Note: XGBoost is ditinguished from other gradient boosting techniques by its regularization mechanism to prevent overfitting.

Cross-validation
Cross-validation combines (averages) measures of fitness in prediction to derive a more accurate estimate of model prediction performance” by “partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set)…multiple rounds of cross-validation are performed using different partitions, and the validation results are combined (e.g. averaged) over the rounds to give an estimate of the model’s predictive performance” [4]

Grid search
Grid search is a process that searches exhaustively through a manually specified subset of the hyperparameter space of the targeted algorithm…and evaluate(s) the cost function based on the generated hyperparameter sets” [5]

Randomized search
Random search…selects a value for each hyperparameter independently using a probability distribution…and evaluate(s) the cost function based on the generated hyperparameter sets” [5]

Bayesian search
…build a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function” [6]

Eta (learning rate, [0,1], default=0.3)
Step size shrinkage used in update to prevents overfitting.” [7]
“It is analogous to learning rate in GBM” [8]

Gamma ([0,∞], default=0 ):
Minimum loss reduction required to make a further partition on a leaf node of the tree.” [7]
A node is split only when the resulting split gives a positive reduction in the loss function” [8]

Maximum depth ([0,∞], default=6):
Maximum depth of a tree. Increasing this value will make the model more complex and more likely to overfit” [7]

Lambda (no bounds provided, default=1):
L2 regularization term on weights.” [7]
…analogous to Ridge regression” [8]

Alpha (no bounds provided, default=0):
L1 regularization term on weights.” [7]
…analogous to Lasso regression” [8]

Tree method ([auto, exact, approx, hist, gpu_hist], default=auto):
exact — “Exact greedy algorithm. Enumerates all split candidates
approxApproximate greedy algorithm using quantile sketch and gradient histogram
histFaster histogram optimized approximate greedy algorithm
gpu_hist GPU implementation of hist algorithm” [7]

Grow policy ([depthwise, lossguide], default=depthwise):
Controls a way new nodes are added to the tree.
depthwisesplit at nodes closest to the root.
lossguide split at nodes with highest loss change” [7]

Accuracy
…a ratio of correctly predicted observations to the total observations” [9]

Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations” [9]

Recall
Recall is the ratio of correctly predicted positive observations to the all observations in actual class” [9]

F1 score
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account…Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall” [9]

Confusion matrix
…a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known” [9]

ROC AUC
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate and False Positive Rate…AUC stands for ‘Area under the ROC Curve.’ That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1)…AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example” [10]

Problem Statement Elaborated

Machine learning models come with default parameters: if you do not assign a specific value or string to an optional parameter, the algorithm does it automatically by a preset value or string. However, optimal parameters may vary from one data set the to other, so default parameters are usually not optimal. One needs to find the optimal parameters by grid search, where the grid represents the experimental values of each parameter (n-dimensional space).

The exponential increase problem —as stated above — in computing power demand appears by applying brute force method and exhaustively search for each combination. The exhaustive search yields a global minimum considering the loss function and so it is appealing to go for each trial. Alternative, faster methods exist — applied methods are listed above — at the cost of a probability to yield a set of parameter values corresponding to a local minimum.

simplified representation of global and local loss function minimum (source: Author)

Options for consideration:

  1. step-by-step search a.k.a. Coordinate descent — optimizing one parameter at a time. Probably leads to local minimum. In case of several hyperparameters, many local minimums may exist. It depends on the order of hyperparameters, which minimum is achieved. There could be cases as well, when a combined modification of two or more hyperparameters would yield a better score, but modifying one of them would not [11]
  2. randomized search — randomly choosing hyperparameter values from the search space. Random point in the search space: probably not a minimum, but chance for a global minimum
  3. Bayesian search — “an extremely powerful technique when the mathematical form of the function is unknown or expensive to compute. The main idea behind it is to compute a posterior distribution over the objective function based on the data (using the famous Bayes theorem), and then select good points to try with respect to this distribution” [12]

The goal of my investigation is to provide a method for comparing the above mentioned approaches by examining the model scores and elapsed time and select the most suitable method, when resources are limited. It is important to note, that each data set may need a separate investigation using a limited subsample, then the most suitable approach would be applied to the whole data set, when best results are concerned.

Data set

The data chosen for demonstration is a sub sample of the CERN Higgs Boson Challenge and has been published as ‘Higgs Boson and a Background Process’ on Kaggle [13].

The original publication has been made by the CERN ATLAS team and the famous Higgs boson has been confirmed by the analysis of this data.

Briefly, the data contains 21 features (kinematic properties) measured by ATLAS, 7 (high-level) features derived from low-level features and a binary feature indicating whether the process is a result of a Higgs process or background noise.

Code and results

Since a picture worth a thousand words, please let me assess the code in the following process flow chart, casually made in Paint (just kidding, made using Sketchviz). Please see the full code at my Jovian profile here.

process flowchart (made by Author with Sketchviz)

Note: in case of each method, the wall time is stored and the number of test cases are all equal for a meaningful comparison.

Several evaluation functions are imported, along with standard libraries:

Data are loaded from CSV format, train data are inspected. IDs are dropped, since it has no contribution for classification as being a unique value for each process:

Inspecting the test dataframe reveals that some columns contain inappropriate characters as they are formatted as objects. It turns out some missing values are represented by ‘?’, hence the object format. The question marks are replaced by 0 as a standard value for representing missing values:

As a standard process, autocorrelation and partial autocorrelation have been evaluated.

Since autocorrelation is not present in any feature, it is safe to say we are not dealing with time series data. Consequently, randomizing the data frame does not make a difference however, for comparison purposes (with published sub sample analysis results), it is applied during train/test split except, that extra care was taken to keep the class frequency ratios the same after split (stratification).

Before executing grid search algorithms, a benchmark model has to be fitted. By calling the fit() method, default parameters are obtained and stored for later use. Since GridSearchCV take inputs in lists, single parameter values also have to be wrapped. By calling fit() on the GridSearchCV instance, the cross-validation is performed, results are extracted, scores are computed and stored in a dictionary.

Important note: apart from searching for the best grid search algorithm, some parameters might be set for speed. Such parameter is tree_method, which set as hist , will organize continuous features in buckets (bins) and reading train data become significantly faster [14]. Please read the reference for more tips in case of XGBoost.

It takes much time to iterate over the whole parameter grid, so setting the verbosity to 1 help to monitor the process. However, wall time does not equal the printed fitting time, hence the loop cycle time is also tracked and printed.

Coordinate descent is a special case in the sense, that performance of subsequent iterations are tracked and can be visualized for selecting the best parameter settings. Plots are as follows:

  1. evaluation scores for each trial of selected hyperparameter grid, along with a confusion matrix for best hyperparameter value
  2. evolution of evaluation scores for best parameter values of subsequent iterations

Important note: test scores are not necessarily increased for each subsequent iteration, since cross-validation is performed over train data, while manual AUC score is computed using test data. Consequently, the final model parameters are picked manually.

Randomized search and Bayesian search are executed in a similar fashion, capturing the results in the same results dictionary.

Finally, the AUC scores and elapsed search time are compared. All methods did the same number of jobs and eventually, Bayesian search came out as a winner. Both Randomized and Bayesian search are executed in shorter time, than CD.

Important note: the execution time and scores may vary performing many times the searches e.g. embedded in a loop. Testing the significance of differences in scores are out of the scope of my article. However, I rerun the notebook several times and Bayesian search has always won, although the execution time has also varied and peaked at around 45 minutes.

Summary

It is interesting to see, how some hyperparameter values are similar or different for all three methods. For example, maximum tree depth is set at the top grid values for CD and Bayesian search, but the lambda parameter is totally different for each. Learning rate was kept at low levels in each case. Of course, 68 trials have been performed out of the possible combinations (which is 631 800), but the model has been improved while saving at least 256 448 out of the expected 256 478 minutes of the brute force grid search.

One can try and extend the number of jobs to be more comparable of the total search space and possibly increase the AUC score further.

It is important to note, that the original paper reported an AUC score of 0.841 and 0.883 using hyperparameter tuned NN (shallow Neural Network) and DNN (Deep Neural Network) respectively. The worst performer CD algorithm resulted a score of 0.8033/0.7241 (AUC/accuracy) on unseen data, while the publisher of the dataset achieved 0.6831 accuracy score using Decision Tree Classifier and 0.6429 accuracy score using Support Vector Machine (SVM). This places the XGBoost algorithm and results in context, considering the hardware used.

References

[1] https://en.wikipedia.org/wiki/Autocorrelation
[2] https://en.wikipedia.org/wiki/Decision_tree_learning
[3] https://en.wikipedia.org/wiki/Gradient_boosting
[4] https://en.wikipedia.org/wiki/Cross-validation_(statistics)
[5] Simon Chan, Philip Treleaven, Chapter 5 — Continuous Model Selection for Large-Scale Recommender Systems, Editor(s): Venu Govindaraju, Vijay V. Raghavan, C.R. Rao, Handbook of Statistics, Elsevier, Volume 33, 2015, Pages 107–124, ISSN 0169–7161, ISBN 9780444634924
[6] https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f
[7] https://xgboost.readthedocs.io/en/latest/parameter.html
[8] https://www.kaggle.com/prashant111/a-guide-on-xgboost-hyperparameters-tuning
[9] https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/
[10] https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
[11] https://en.wikipedia.org/wiki/Coordinate_descent
[12] https://cloud.google.com/blog/products/ai-machine-learning/hyperparameter-tuning-cloud-machine-learning-engine-using-bayesian-optimization
[13] https://www.kaggle.com/mragpavank/higs-bonsons-and-background-process/tasks?taskId=3370
[14] https://towardsdatascience.com/do-you-use-xgboost-heres-how-to-make-it-200x-faster-16cb6039a16e

--

--

Biochemical engineer, former environmental lab analyst with quality focus. I do my best to bring high quality content to data science learners/enthusiasts.