Pair-Wise Hyperparameter Tuning with the Native XGBoost API

Search Global Minimum while addressing Bias-Variance Trade-off

Michio Suginoo
Towards Data Science

--

Photo by Markus Spiske on Unsplash

Since Boosting Machine has a tendency of overfitting, XGBoost has an intense focus on addressing bias-variance trade-off and facilitates the users to apply a variety of regularization techniques through hyperparameter tuning.

This post will walk you through the code implementation of hyperparameter tuning using the native XGBoost API to address bias-variance trade-off. The entire code of this project is posted in my Github repository.

After reading this post, you can replicate or modify the process that I demonstrate below to conduct hyperparameter tuning for your own regression project with the native XGBoost API.

INTRODUCTION

Our objective here is to perform hyperparameter tuning of the native XGBoost API in order to improve its regression performance while addressing bias-variance trade-off — especially to alleviate Boosting Machine’s tendency of overfitting.

In order to conduct hyperparameter tuning, this analysis uses the grid search method. In other words, we select the search grid of hyperparameters and calculate the model performance over all the hyperparameter datapoints on the search-grid. Then, we identify the global local minimum of the performance — or the hyperparameter datapoint which yields the best performance (the minimum value of the Objective Function) — as the best hyperparameter values for the tuned model.

Computational Constraint

Hyperparameter tuning can be computationally very expensive depending on how you set the search grid. Simply because it needs to iterate performance calculations over all the datapoints that you set in the search-grid. The more datapoints you have, the more expensive computationally. Very simple.

Unfortunately, my notebook has a very limited computational capacity. But, a good news is that Google Colab provides one GPU per user for its free account. And the native XGBoost API has GPU support feature. Altogether, I can speed up the tuning process by running the native XGBoost API with Google Colab’s GPU.

All that said, one GPU is still not sufficient when the selected search-grid has an enormous amount of datapoints to cover. Therefore, I still need to address the constraints of the computational resource.

Ideally, one single joint tuning over all the selected hyperparameters would accurately capture all the possible interactions among them. Nevertheless, the joint tuning has an enormous search-grid datapoints, thus, would be computationally expensive. In this context, I will perform multiple pair-wise hyperparameter tunings, rather than a single joint simultaneous tuning over all the hyperparameters. This will not only reduce the volume of the hyperparameter datapoints for tuning, but also allow us to render the 3D visualization of the performance landscape for each pair-wise tuning. Of course, there is a catch in this approach. We will discuss the issue later.

Now, let’s start coding. First, we need to download the data for the analysis.

ANALYSIS

A. Data

For the regression analysis, we use the sklearn API’s built-in California Housing dataset. The source of the dataset is houses.zip in StatLib’s Datasets Archive. The license of the dataset, “License: BSD 3 clause”, can be found in the line 22 of the code here. For the explanation about the dataset, please read this link.

We can download California Housing Dataset from sklearn.datasets using fetch_california_housing function.

We can store the dataset into the variable, housing_data.

from sklearn.datasets import fetch_california_housing
housing_data = fetch_california_housing()

Now, we need to convert the imported dataset into the Pandas dataframe. We separate the features and the target variable into df1 and df2 respectively.

descr = housing_data['DESCR']
feature_names = housing_data['feature_names']
data = housing_data['data']
target = housing_data['target']
df1 = pd.DataFrame(data=data)
df1.rename(columns={0: feature_names[0], 1: feature_names[1], 2: feature_names[2], 3: feature_names[3],
4: feature_names[4], 5: feature_names[5], 6: feature_names[6], 7: feature_names[7]}, inplace=True)
df2 = pd.DataFrame(data=target)
df2.rename(columns={0: 'Target'}, inplace=True)
housing = pd.concat([df1, df2], axis=1)
print(housing.columns)

Now, let’s see the summary information of the features of the dataframe.

df1.info()

Here is the output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
memory usage: 1.3 MB

B. Libraries to Import

Next, let’s go over the libraries that we are going to use.

  • First, we need the native XGBoost API to run it and its plot_importance to display the feature importance of the selected model after the tuning.
import xgboost as xgb
from xgboost import plot_importance
  • We also use mean_squared_error from sklearn.metrics for metrics.
from sklearn.metrics import mean_squared_error
  • For train-test dataset split, train_test_split from sklearn.model_selection can facilitate us to split the dataframe.
from sklearn.model_selection import train_test_split
  • For visualization, we use the following items from Matplotlib to render 3D images.
from matplotlib import pyplot
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D

C. Hyperparameters & Pair-Wise Tuning

1. Selection of Hyperparameters for Tuning

There are so many hyperparameters built in the native XGBoost API. Due to the given limited computation resource availability, we have to do some cherry-picking.

Jason Brownlee outlines that there are 5 types of hyperparameters for regularization: iteration control, shrinkage, tree constraints, random sampling, and L1 & L2 regularizations. Following his framework, this analysis selects the following 9 hyperparameters to address bias-variance trade-off. For the details of each hyperparameter, please follow the hyperlink set in each hyperparameter below.

a) Iteration control: The following 2 hyperparameters operate the iteration control.

The rest of hyperparameters need to be set in one dictionary, pram_dict. We can use this parameter dictionary to manipulate hyperparameters’ values when we iterate the k-fold cross-validation over the datapoints on the search-grid.

b) Shrinkage:

c) Tree-Booster Constraints:

d) Random Subsampling:

e) L1 & L2 regularizations

f) Additional entries not for tuning:

Furthermore, we need to specify the following 4 additional hyperparameters in the parameter dictionary. As a precaution, these hyperparameters are going to be fixed but not to be tuned.

  • objective: select the objective function for regression.
  • eval_metric: select the performance evaluation metric for regression
  • gpu_id: 0 as specified earlier by Google Colab
  • tree_method: gpu_hist to select GPU setting

So far, we talked about 3 things: the selection of hyperparameters for tuning, the initial setting of the parameter dictionary, and the initial (pre-tuned) performance curves to assess the initial bias-variance trade-off. And we identified substantial variance, or overfitting.

3. Pair-Wise Hyperparameter Tuning

Ideally, one single joint tuning over all these hyperparameters would accurately capture all the possible interactions among these selected hyperparameters in the parameter dictionary. Nevertheless, the joint tuning has an enormous search-grid datapoints, thus, would be computationally expensive.

In order to economize the computational cost of the analysis, I made 4 parameter-pairs out of 8: namely

  • (max_depth, eta)
  • (subsample, colsample_bytree)
  • (min_child_weight, gamma), and
  • (reg_alpha, reg_lambda)

Then, I conducted a pair-wise tuning for each of these 4 hyperparameter pairs. After each tuning, the best tuning result was used to replace the initial values of the hyperparameter pair.

In this way, every pair-wise tuning incrementally improved the performance profile of the tuned model. This altogether reduced the number of the search-grid datapoints overall, thus, economized the computational cost of the tuning process at the expense of the impartiality of the grid search. Of course, the underlying assumption here was that the result obtained through the 4 pair-wise tunings would not materially deviate from the result obtained by the ideal one single joint tuning over all the 8 hyperparameters.

D. Pre-tuning Regression: Performance Curves on Train and Test Datasets

Before tuning, we first need to initialize the values of our parameter dictionary to perform the first regression on both the training dataset and the test dataset. It is to compare the initial performance curves on those datasets along the training iterations. This is important for two reasons.

  • First, we want to compare the training performance and the test performance to assess bias-variance trade-off, the problem of underfitting-overfitting trade-off, at the initial state.
  • Second, at the end of the tuning, we want to compare pre-tuning performance curves (both training and test) and post-tuning performance curves to assess the regularization impact of the tuning. We want to assess whether the tuning is effectively performing regularization on the model to address bias-variance trade-off.

Here is the initialization of our parameter dictionary.

param_dict={'objective': 'reg:squarederror', # fixed. pick an objective function for Regression. 
'max_depth': 6,
'subsample': 1,
'eta': 0.3,
'min_child_weight': 1,
'colsample_bytree': 1,
'gamma': 0,
'reg_alpha': 0.1,
'reg_lambda': 1,
'eval_metric': 'mae', # fixed. picked a evaluation metric for Regression.
'gpu_id': 0,
'tree_method': 'gpu_hist' # XGBoost's built-in GPU support to use Google Colab's GPU
}

The values assigned here are solely for the initialization purpose. These values (except for objective, eval_metric, gpu_id, and tree_method) will change during the hyperparameter tuning.

Besides those hyperparameters specified in pram_dict, we need to initialize 2 other hyperparameters num_boost_round and early_stopping_rounds outside of pram_dict.

num_boost_round = 1000
early_stopping_rounds = 10

Now, using the initialized parameter dictionary, param_dict, we can use the built-in native training function, xgb.train(), to define a custom function, call it bias_variance_performance(). Its’ purpose is to assess the bias-variance trade-off along the training iterations, by comparing the initial model performance curves between the training dataset and the test dataset.

def bias_variance_performance(params):

evals_result = {}
model = xgb.train(
params,
dtrain=DM_train,
num_boost_round=num_boost_round,
evals=[(DM_train, "Train"), (DM_test, "Test")],
early_stopping_rounds=early_stopping_rounds,
evals_result=evals_result
)
train_error = evals_result["Train"]["mae"]
test_error = evals_result["Test"]["mae"]

print("Best MAE before the Tuning: {:.4f} with {} rounds".format(model.best_score, model.best_iteration+1))
return model, evals_result

For the details of the built-in training function, xgb.train(), please refer to the documentation.

Here is the output:

Best MAE before the Tuning: 0.3103 with 121 rounds

The output shows the best values of the test performance, not the training performance. The best test performance was detected at the 121st round of iterations. In other words, the initial (pre-tuned) model (with the initial hyperparameter setting) was determined at 121 rounds of iterations.

Let’s compare the initial (pre-tuned) model performances between the train dataset and the test dataset at the 121st iteration.

[120]	Train-mae:0.17322	Test-mae:0.31031

The initial (pre-tuned) model (with the initial hyperparameter setting) determined at 121 rounds of the iteration has its training performance at 0.17322 and the test performance at 0.3103. The substantial gap suggests variance, or over-fitting.

The chart below has two initial (pre-tuned) performance curves along with the iteration rounds: one in blue for the training performance and the other in red for the test performance.

Produced by Michio Suginoo

As the iteration went on, the initial (pre-tuned) training performance continued improving, while the initial (pre-tuned) test performance stagnated at some point. This suggests a salient variance, or over-fitting, in the initial (pre-tuned) model.

Now, in order to address bias-variance trade-off, we impose regularization of the model by tuning the hyperparameters of the native XGBoost API.

E. Tuning Method:

In terms of tuning methodology, here, I would like to highlight a few important footnotes.

  • I set the search-grid for each hyperparameter pair and conducted the grid search method for pair-wise parameter tuning.
  • I iterated k-fold cross validation to calculate the performance metric on the training dataset for all the datapoints on the pre-determined search-grid.
  • The visualization used triangulation-interpolation technique to fill the gaps in-between the search-grid datapoints and render a smoothed-out surface of the performance landscape.

a) k-fold Cross Validation

In order to run k-fold Cross Validation, we use the built-in cross validation function, xgb.cv(). As a precaution, k-fold Cross Validation is performed only on the train dataset (from which the validation dataset is allocated): do not pass the test-dataset into the cross validation function: xgb.cv().

We can iterate k-fold cross validation over all the datapoints on the pre-determined search-grid for each hyperparameter pair to search the global minimum of the regression performance metrics, mean absolute errors.

Custom function of k-fold cross validation: (Optional)

In order to facilitate the tuning iteration, we can customize a k-fold cross validation function by using the native XGBoost API’s built-in cross-validation function, xgb.cv(). For our case, we set 5 fold for our analysis. Alternatively, other value, e.g. 10, can be explored.

Here is the code of my custom cross validation function, cross_validation(). This customization is a matter of personal preference, thus, an optional step. Regarding hyperparameters to be passed, please check the documentation.

def cross_validation(param_dict, DM_train, metrics, num_boost_round, early_stopping_rounds): 

cv_results = xgb.cv(
params=param_dict,
dtrain=DM_train,
num_boost_round=num_boost_round,
seed=42, # seed for randomization
nfold=5, # k of k-fold cross-validation
metrics=metrics,
early_stopping_rounds=early_stopping_rounds
)

return cv_results

Pair-Wise Grid Search: pair_wise_gridsearch_CV

Now, in order to define the pair-wise grid search, we can define a function to iterate the k-fold Cross Validation custom function, cross_validation(), over all the datapoints of the search-grid of a hyperparameter pair. The next function, pair_wise_gridsearch_CV, takes 3 variables:

  • param1_name and param1_name: the names of the pair of hyperparameters to be tuned.
  • gridsearch_params: the search-grid of the hyperparameter pair to be tuned.
def pair_wise_gridsearch_CV(param1_name, param2_name, gridsearch_params):min_mae = float("Inf")
best_params = None
x = []
y = []
z = []
for param1, param2 in gridsearch_params:
# Update our parameter dictionary for the tuning
param_dict[param1_name] = param1
param_dict[param2_name] = param2
print("CV with {}={}, {}={}".format(param1_name, param1,
param2_name,
param2))
# calculate cross_validation
cv_results = cross_validation(param_dict, DM_train, metrics={'mae'}, num_boost_round=1000, early_stopping_rounds=10)
mean_mae = cv_results['test-mae-mean'].min()
boost_rounds = cv_results['test-mae-mean'].argmin()

x.append(param1)
y.append(param2)
z.append(mean_mae)

print("\tMAE {} for {} rounds".format(mean_mae, boost_rounds))

if mean_mae < min_mae:
min_mae = mean_mae
best_params = (param1, param2)

return x, y, z, min_mae, best_params

The function, pair_wise_gridsearch_CV, iterates the custom cross-validation function, cross_validation, over all the datapoints on the search-grid of the hyperparameter pair defined in gridsearch_params to search for the global minimum of the performance landscape.

gridsearch_params needs to be pre-determined before each pair-wise hyperparameter tuning. This is to be shown later.

b) 3D Visualization of the performance landscape for hyperparameter pair

Now, we can define a visualization utility function Trisurf() to render 3D visualization of the performance landscape of the hyperparameters. We will use matplotlib’s plot_trisurf().

plot_trisurf() uses the triangulation interpolation technique to fill the gap between the datapoints on the search-grid to render a smoothed-out surface of performance landscape.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def Trisurf(xname, yname, x, y, z, min_mae, best_params):
# (xname, yname): the name of variables to be displayed for visualization
# (`x`,`y`): all the hyperparamters' datapoints on the search grid
# (`z`): the results of the evaluation metric the results of the evaluation metric over the datapoints
# (`min_mae`, `best_params`): the Native XGBoost API's `k-fold cross-validation` function's outputs the minimum mae result and the corresponding datapoint of the hyperparameters
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(111, projection="3d")
# Creating color map
my_cmap = plt.get_cmap('RdPu')

# Creating dot plot of the minimum
## min_mae*1.01 to float it for the visualization purpose.
ax.scatter(best_params[0], best_params[1], min_mae, color='red')
# Tri-Surface Plot
ax.plot_trisurf(x,y,z,cmap = my_cmap,
linewidth = 0.2,
antialiased = True,
edgecolor = 'grey')
ax.set_xlabel(xname)
ax.set_ylabel(yname)
ax.set_zlabel('mae')

# Write the Title
ax.set_title("Hyperparameters Performance Landscape ("+str(xname) + ' & '+str(yname)+")")

# Download the figure as jpeg file.
figname = str(xname) + '_'+str(yname)
fig.savefig(str(figname) + '.jpg')

For further details of the code, please visit my Github repo.

F. Tuning Implementation

Up to now, we have prepared all those custom utility functions we need for tuning.

Now, we are ready to implement the tuning process.

We will perform a pair-wise hyperparameter tuning for each pair at a time. After every pair-wise tuning, we will update the values of each hyperparameter pair with the best result (which yields the lowest value of the loss function). And we move on to the next pair of hyperparameters to repeat the process. In this way, we can incrementally improve the performance of the model under tuning.

Now, we will go over this process for 4 pairs of hyperparameters.

a) The 1st pair: max_depth and eta

Now, in order to tune the 1st hyperparameter pair, max_depth and eta, we set the search-grid for them as follows.

pair_wise_gridsearch_params1 = [
(max_depth, eta)
for max_depth in range(5,15)
for eta in [i/10. for i in range(1, 10)]
]

Now, let’s pass this search grid into the custom function, pair_wise_gridsearch_CV(), to tune the model over the hyperparameter pair. This is the code.

x1, y1, z1, min_mae1, best_params1 = pair_wise_gridsearch_CV('max_depth', 'eta', pair_wise_gridsearch_params1)

Here is the description of the output variables on the left hand side of the equation.

  • x1 and y1 store the all the datapoints of the hyperparameter pair on the search-grid;
  • z1 stores the performance value at each search-grid datapoint (x1, y1);
  • min_mae1 stores the best performance value, the minimum value of the loss function.
  • best_params1 stores the search-grid datapoint of the hyperparameter pair that yielded the best performance.

min_mae1 and best_params1 are the result of the pair-wise tuning.

All these variables will be passed into the custom visualization utility function that we just defined earlier to render the 3D visualization of the performance landscape.

Here is the result.

Best params: 7, 0.1, MAE: 0.2988302932498683

The output says that the best performance at 0.2988 was generated at the search-grid of max_depth = 7 and eta = 0.1.

Now for the visualization purpose, we will pass all the tuning outputs — x1, y1, z1 , min_mae1, and best_params1—together with the names of the hyperparameter pair into the custom visualization function.

Trisurf("max_depth", 'eta',x1,y1,z1, min_mae1, best_params1)

Here is the visualization of the performance landscape over all the datapoints of the hyperparameter pair on the search-grid.

3D Visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

The performance landscape of the first pair, max_depth and eta, demonstrated a reasonable consistency in the shape of the performance landscape without multiple dips and bumps. It appears to indicate a reasonable predictability of the model performance around the vicinity of the best values of these two hyperparameters.

Now, let’s update the values of the hyperparameter pair with the best result of the pair-wise tuning, which is max_depth = 7 and eta = 0.1. Since these values were stored in the variable, best_params1, we can update the parameter dictionary regarding the hyperparameter pair of max_depth and eta as follows.

param_dict["max_depth"] = best_params1[0]
param_dict["eta"]= best_params1[1]

This incrementally improves the performance of the model under tuning.

Now, we can repeat the same process over the 3 other hyperparameter pairs to tune the model. As a precaution, the outputs of custom functions will have different numerical suffix going-forward to distinguish among different hyperparameter pairs. Nevertheless, the process will go through the exact same steps that we just went through above.

For the next 3 other hyperparameter pairs, I will skip the details of the process and show only relevant processes.

b) The 2nd pair: subsample and colsample_bytree

Next, we have two hyperparameters associated with subsampling, subsample and colsample_bytree. Since both deal with subsampling, they might demonstrate some interactions between each other. For this reason, we tune the pair together to see the interaction.

First, let’s set the search-grid for the pair as follows.

pair_wise_gridsearch_params2 = [
(subsample, colsample_bytree)
for subsample in [i/10. for i in range(1, 10)]
for colsample_bytree in [i/10. for i in range(1, 10)]
]

Now, let’s pass this search grid into the custom function, pair_wise_gridsearch_CV(), to tune the model over the 2nd hyperparameter pair.

x2, y2, z2, min_mae2, best_params2= pair_wise_gridsearch_CV('subsample', 'colsample_bytree', pair_wise_gridsearch_params2)

Here is the result.

Best params: 0.9, 0.8, MAE: 0.2960672089507568

The output says that the best performance at 0.296 was generated at the search-grid of subsample =0.9 and colsample_bytree =0.8. As intended, the performance was incrementally improved from the last tuning, which yielded the performance of 0.2988.

Here is the visualization of the performance landscape over all the datapoints of the search-grid of the hyperparameter pair.

3D Visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

The performance landscape of the second pair, subsample and colsample_bytree, demonstrated a reasonable consistency in the shape of the performance landscape without multiple dips and bumps. It appears to indicate a reasonable predictability of the model performance around the vicinity of the best values of these two hyperparameters.

Now, let’s update the values of the hyperparameter pair with the best result of the pair-wise tuning, which is subsample = 0.9 and colsample_bytree = 0.8.

param_dict["subsample"] = best_params2[0]
param_dict["colsample_bytree"]= best_params2[1]

This again incrementally improves the performance of the model.

c) The 3rd pair: min_child_weight & gamma

Next, the 3rd hyperparameter pair, min_child_weight and gamma, are both associated with partition of trees. So, let’s tune them together to see the interactions between them.

First, let’s set the search-grid for the pair as follows.

pair_wise_gridsearch_params3 = [
(min_child_weight, gamma)
for min_child_weight in range(0, 10)
for gamma in range(0, 10)
]

Now, let’s pass this search grid into the custom function, pair_wise_gridsearch_CV(), to tune the model over all the datapoints on the search-grid of the hyperparameter pair.

x3, y3, z3, min_mae3, best_params3 = pair_wise_gridsearch_CV('min_child_weight', 'gamma', pair_wise_gridsearch_params3)

Here is the result.

Best params: 3, 0, MAE: 0.29524631108655486

The output says that the best performance at 0.2952 was generated at the search-grid of min_child_weight = 3 and gamma =0 . As intended, the performance was incrementally improved from the last tuning, which yielded the performance of 0.296.

Here is the visualization of the performance landscape over the different values of the hyperparameter pair.

3D Visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

The performance landscape of these two partition-associated hyperparameters, min_child_weight and gamma, demonstrated a reasonable consistency in the shape of the performance landscape without multiple dips and bumps. It appears to indicate a reasonable predictability of the model performance around the vicinity of the best values of these two hyperparameters.

Now, let’s update the values of the hyperparameter pair with the best result of the pair-wise tuning, which is min_child_weight = 3 and gamma =0.

param_dict["min_child_weight"] = best_params3[0]
param_dict["gamma"]= best_params3[1]

d) The 4th pair: reg_alpha and reg_lambda

At last, we will conduct the last pair-wise hyperparameter tuning over the following two regularization hyperparameters, reg_alpha and reg_lambda.

First, let’s set the search-grid for the pair as follows.

pair_wise_gridsearch_params4 = [
(reg_alpha, reg_lambda)
for reg_alpha in [0, 1e-2, 0.1, 1, 2, 3, 4, 8, 10, 12, 14]
for reg_lambda in [0, 1e-2, 0.1, 1, 2, 3, 4, 8, 10, 12, 14]
]

Now, let’s pass this search grid into the custom function, pair_wise_gridsearch_CV(), to tune the model over the hyperparameter pair.

x4, y4, z4, min_mae4, best_params4 = pair_wise_gridsearch_CV('reg_alpha', 'reg_lambda', pair_wise_gridsearch_params4)

Here is the result.

Best params: 0.1000, 8.0000, MAE: 0.2923583947526392

The output says that the best performance at 0.292358 was generated at the search-grid of reg_alpha = 0.1 and reg_lambda = 8.0. As intended, the performance was incrementally improved from the last tuning, which yielded the performance of 0.2952.

Here is the visualization of the performance landscape over the different values of the hyperparameter pair.

3D Visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

Unlike 3 other earlier performance landscapes, the pair-wise performance landscape of reg_alpha and reg_lambda rendered multiple local minima as you see in the chart. It demonstrated a rugged performance landscape with various dips and bumps. It is an evidence that the performance of the model during the tuning was very sensitive to small changes in the values of these 2 regularization hyperparameters: reg_alpha and reg_lambda.

In other words, a slight change in their values can greatly influence the result of their performance. This might well translate into a model instability, when we pass a new type of dataset into the tuned model in the deployment domain.

This needs to be closely studied before making a final conclusion.

For now, we update the parameter dictionary with the current best values for reg_alpha and reg_lambda and calculate the performance of the tuned model on the test dataset.

param_dict["reg_alpha"] = best_params4[0]
param_dict["reg_lambda"]= best_params4[1]

G. Post 1st Pair-Wise Tuning Test Result

Since we have updated the parameter dictionary, param_dict, on the piecemeal basis, its current values are already updated over all the 8 hyperparameters. Let’s print it to confirm the updated values of the hyperparameters after tuning.

print("Parameters after the Pair Wise Tuning:", param_dict)

Here is the output.

Parameters after the Pair Wise Tuning: {'objective': 'reg:squarederror', 'max_depth': 7, 'subsample': 0.9, 'eta': 0.1, 'min_child_weight': 3, 'colsample_bytree': 0.8, 'gamma': 0, 'reg_alpha': 0.1, 'reg_lambda': 8, 'eval_metric': 'mae', 'gpu_id': 0, 'tree_method': 'gpu_hist'}

Now, we can fit on the test dataset the tuned model with the updated values of those hyperparameters to assess the improvement on the test performance. Let’s pass the updated parameter dictionary, param_dict, into the custom function of bias_variance_performance().

pw_model_1, pw_evals_result_1=bias_variance_performance(param_dict)

Here is the result.

Best MAE after the Pair-Wise Tuning: 0.2903 with 315 rounds

Let’s see the results at the 315th round.

[314]	Train-mae:0.15790	Test-mae:0.29034

The post-tuned model (with the tuned hyperparameter setting) was determined at 315 rounds of iterations. And it has its performance on the training dataset at 0.1579, while at 0.29034 on the test dataset.

Compared with the pre-tuned model performance — which are at 0.17322 on the training dataset, while at 0.3103 on the test dataset — there was an improvement.

Here is the visualization of the performance curves on both the training dataset and the test dataset.

Produced by Michio Suginoo

The result yielded a slight improvement in bias-variance trade-off.

H. Multiple Local Minima and Model Instability: Highly Sensitive Model Performance regarding reg_alpha and reg_lambda

Repeatedly, the tuning process of these 2 regularization parameters, reg_alpha and reg_lambda, generated the rugged performance landscape with various dips and bumps and multiple local minima. It is an evidence that the performance during the tuning process was very sensitive to small changes in the values of these 2 regularization hyperparameters. A slight change in their values can dramatically influence the result of their performance, the value of the objective function (loss function).

Now, we can at least make a closer observation on the current performance landscape and assess the sensitivity of the model to small changes in the values of the hyperparameter pair.

  • First, we can rank the top 10 tuning results according to their performance metric and identify the top 10 datapoints.
  • Second, based on the observation on the top 10 tuning results, we can slice the performance landscape to have a closer look at its consistency, or the performance stability.

a) Check the top 10 lowest performance values, mae

The next sorted dataframe ranks the top 10 performance datapoints in an ascending order of the values of the evaluation metrics of mae.

Now when we see reg_alpha, the top 10 tuning result datapoints are distributed within (0.1, 4.0) along reg_alpha.

For reg_lambda, its top 10 ranking values scattered across a wide range (0.00 to 12).

In order to gain a better insight, we can slice the 3D performance landscape at the following sections:

All those values appear in the top 2 performance datapoints in the list. And here are their sliced performance curves:

Produced by Michio Suginoo
Produced by Michio Suginoo
Produced by Michio Suginoo

As an example, the 2nd slice of the performance landscape at (reg_alpha = 3.0) identifies an acute abyss formed between (reg_lambda =0.01) and (reg_lambda =1.0).

The intense asperity in the shape of the abyss was captured by the finer local search grid setting along (reg_lambda =0.01), (reg_lambda =0.1), and (reg_lambda = 1) than any other part of the search grid. In other words, if we missed the single datapoint of (reg_lambda =0.1) in our search-grid, we would have failed to capture the presence of the local minimum between (reg_lambda =0.01) and (reg_lambda =1.0).

A much finer granularity in the search-grid setting might reveal a much more intense asperity of the rugged performance landscape. That is to say, we might discover more dips and bumps with a more granular search-grid setting. It might uncover the presence of many other hidden local minima, which are invisible in the current performance landscape. And even it could discover the best performance datapoint somewhere the current search-grid can not capture.

b) 2nd round pair-wise tuning on reg_alpha and reg_lambda

Now, let’s define a new search-grid with a much finer granularity. Here is the new search-grid setting.

pair_wise_gridsearch_params4 = [
(reg_alpha, reg_lambda)
for reg_alpha in [0, 1e-2, 2e-2, 3e-2, 0.1, 0.5, 1, 1.5, 1.8, 1.9, 2, 2.1, 2.2, 2.5, 3, 3.5, 4]
for reg_lambda in [0, 1e-2, 2e-2, 3e-2, 0.1, 0.5, 1, 1.5, 1.8, 1.9, 2, 2.1, 2.2, 2.5, 3, 4, 5, 6,
7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5, 12, 12.5, 13, 13.5, 14, 14.5, 15, 15.5]
]

We can run the pair-wise tuning again on the pair of reg_alpha and reg_lambda with the new search-grid setting above.

Here is the result of the 2nd tuning:

Best params: 0.5000, 13.5000, MAE: 0.29122008437950175

The output says that the best performance at 0.29122 was generated at the search-grid of reg_alpha = 0.5 and reg_lambda =13.5. As speculated, the much finer granularity of search-grid identified a better best performance datapoint at a totally different location from the first result, which is reg_alpha = 0.1 and reg_lambda = 8.0, which yielded the performance of 0.292358.

Here is the visualization of the performance landscape with the new search-grid setting.

3D Visualization with Matplotlib’s plot_trisurf: Produced by Michio Suginoo

As speculated, the 2nd tuning captured a more rugged performance landscape with more dips and bumps. As speculated, the much finer granularity of search-grid identified a better best performance datapoint, reg_alpha = 0.5 and reg_lambda =13.5, at a totally different location from the first result, which is reg_alpha = 0.1 and reg_lambda =8.0, which yielded the performance of 0.292358. The 2nd tuning yielded a better best performance at 0.29122 as well.

I. Model Selection

Now, can we say that the 2nd tuned model is better than the 1st tuned model?

We still do not know yet.

Remember, we are comparing the results based on the training dataset, which includes the validation dataset. We need to test these tuned models with test dataset to assess bias-variance trade-off, the problem of underfitting-overfitting trade-off.

a) Bias-Variance Trade-off, the risk of over-fitting

Now, let’s run the newly tuned model over the test dataset. We can pass the latest update of the parameter dictionary, param_dict, to the custom function, bias_variance_performance().

pw_model_2, pw_evals_result_2=bias_variance_performance(param_dict)

Here is the result.

Best MAE after the Pair-Wise Tuning: 0.2922 with 319 rounds

That says that the best test performance, 0.2922, was found at the 319th rounds.

[318]	Train-mae:0.17208	Test-mae:0.29225

Now, let’s compare the result of the 2nd post-tuning model with the pre-tuning model and the 1st post-tuning model.

The comparison table above shows that in terms of the test performance, the 1st tuning yielded the best performance among three. This implies that the better training performance of the 2nd tuning was the result of overfitting the model on the training dataset (including the validation dataset). This is a typical bias-variance trade-off.

Since the overall objective of the tuning during the model development is to pursue the model stability in the deployment domain, we need to address bias-variance trade-off in the selection of the tuned model.

For this reason, we are better off selecting the result of the 1st tuning, rather than the 2nd tuning.

b) Plot the Feature Importance

Now, we want to know how much of contribution each feature made in defining the tuned model. We can plot the Feature Importance, using the 1st tuned model.

from xgboost import plot_importance
plot_importance(pw_model_1)
pyplot.show()

Here is the output.

Produced by Michio Suginoo

CONCLUSION

This post walked you through the steps of the code implementation of the hyperparameter tuning of the native XGBoost API.

The native XGBoost API enables the users to tune hyperparameters to apply various regularization techniques to address bias-variance trade-off. And this analysis iterated the built-in cross validation function to tune the hyperparameters.

Now you can replicate or modify the process for your own projects.

I will share my further observation on the result of this particular case separately in the following post: Risk Implications of Excessive Multiple Local Minima during Hyperparameter Tuning.

Also, if you are interested in a quick overview of XGBoost, you are welcome to read my another post: XGBoost: its Genealogy, its Architectural Features, and its Innovation

Thanks for reading this long post.

ACKNOWLEDGEMENT

I would like to express my gratitude towards TDS’s editing team, especially Katherine Prairie, for their invaluable editing advices during the editing stage.

References

Brownlee, J. A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning. (2016). Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/

databricks. xgboost-linux64. (2017). . Retrieved from github: https://github.com/databricks/xgboost-linux64/blob/master/doc/model.md

Hunter, J., Dale, D., Firing, E., Droettboom, M., & the Matplotlib development team. The mplot3d Toolkit. (2019). Retrieved from matplotlib.org: https://matplotlib.org/3.1.0/tutorials/toolkits/mplot3d.html#mpl_toolkits.mplot3d.Axes3D.plot_trisurf

jeeteshgavande30, singghakshay, & anikakapoor. Tri-Surface Plot in Python using Matplotlib. (2021). Retrieved from geeksforgeeks: https://www.geeksforgeeks.org/tri-surface-plot-in-python-using-matplotlib/

Leon, D. M. _california_housing.py. (2022). Retrieved from https://github.com: https://github.com/scikit-learn/scikit-learn/blob/36958fb24/sklearn/datasets/_california_housing.py#L53

Pace, R. K. & Ronald B. (1997). Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 291–297. StatLib Datasets Archive: http://lib.stat.cmu.edu/datasets/

scikit-learn developers. sklearn.datasets.fetch_california_housing. (n.d.) Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

Wikipedia. Bias–variance tradeoff. (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#:~:text=In%20statistics%20and%20machine%20learning,bias%20in%20the%20estimated%20parameters.

Wikipedia. Cross-validation (statistics). (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Cross-validation_%28statistics%29#k-fold_cross-validation

Wikipedia. Surface triangulation. (2022). Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Surface_triangulation

xgboost developers. XGBoost Documentation. (n.d.). Retrieved from https://xgboost.readthedocs.io/: https://xgboost.readthedocs.io/en/stable/

--

--

CFA, Data Science, Innovation, Paradigm Shift, Paradox Hunting, Teleological Pursuit