Faster Hyperparameter Tuning with Scikit-Learn’s HalvingGridSearchCV

Comparing Halving Grid Search to the Exhaustive GridSearchCV

Kyle Gilde, ML Data Scientist
Towards Data Science

--

Photo by Roberta Sorge on Unsplash

If you are a Scikit-Learn fan, Christmas came a few days early in 2020 with the release of version 0.24.0. Two experimental hyperparameter optimizer classes in the model_selection module are among the new features: HalvingGridSearchCV and HalvingRandomSearchCV.

Like their close cousins GridSearchCV and RandomizedSearchCV, they use cross-validation to find optimal hyperparameters. However, instead of independently searching the hyperparameter set candidates, their successive halving search strategy “starts evaluating all the candidates with a small number of resources and iteratively selects the best candidates, using more and more resources.” The default resource is the number of samples, but the user can set it to any positive-integer model parameter like gradient boosting rounds. Thus, the halving approach has the potential of finding good hyperparameters in less time.

My Experiment

I read through Scikit-Learn’s “Comparison between grid search and successive halving” example, but because takes a grand total of 11 seconds to run, I was still unclear about the real-world impact of using the halving versus exhaustive approach. So I decided to set up an experiment to answer the following questions:

  1. How much faster is HalvingGridSearchCV compared to GridSearchCV?
  2. Does HalvingGridSearchCV still select the same hyperparameter set that GridSearchCV does?

I’m going to run and compare 3 searches:

  1. GridSearchCV
  2. HalvingGridSearchCV using the default “n_samples” resource
  3. HalvingGridSearchCV using the CatBoost’s “n_estimators” as the resource

Upgrade Scikit-Learn

The first step is to upgrade your version of Scikit to 0.24.0 and make sure you can import the correct version.

# !! pip install scikit-learn --upgrade
import sklearn
print(sklearn.__version__)
0.24.0

Loading the Dataset

I ran my tests using the Kaggle’s Ames, IA house prices dataset. It has 1,460 observations and 79 features. The dependent variable is SalePrice of the home. I recommend reading this notebook if you are interested in some exploratory data analysis on the dataset.

import numpy as np  
import pandas as pd

DEP_VAR = 'SalePrice'
train_df = pd.read_csv('../kaggle/input/house-prices-advanced-regression-techniques/train.csv')\
.set_index('Id')

y_train = train_df.pop(DEP_VAR)

Creating a Pipeline & Model

I also wrote a script called pipeline_ames.py. It instantiates a Pipeline containing some feature transformations and the CatBoostRegressor. I’ve plotted its visual representation below. (You can read more about my approach to feature engineering in my previous post.)

from sklearn import set_config        
from sklearn.utils import estimator_html_repr
from IPython.core.display import display, HTML

from pipeline_ames import pipe
set_config(display='diagram')
display(HTML(estimator_html_repr(pipe)))

Experimental Controls

The grid_search_paramsdictionary contains the control parameters that were used in the 3 searches. I performed 3-fold cross-validation on param_grid, which contains 4 CatBoost hyperparameters with 3 values each. The results were measured in root mean squared log error (RMSLE).

from sklearn.metrics import mean_squared_log_error, make_scorer

np.random.seed(123) # set a global seed
pd.set_option("display.precision", 4)

rmsle = lambda y_true, y_pred:\
np.sqrt(mean_squared_log_error(y_true, y_pred))
scorer = make_scorer(rmsle, greater_is_better=False)

param_grid = {"model__max_depth": [5, 6, 7],
'model__learning_rate': [.01, 0.03, .06],
'model__subsample': [.7, .8, .9],
'model__colsample_bylevel': [.8, .9, 1]}

grid_search_params = dict(estimator=pipe,
param_grid=param_grid,
scoring=scorer,
cv=3,
n_jobs=-1,
verbose=2)

Tests

1. GridSearchCV

The baseline exhaustive grid search took nearly 33 minutes to perform 3-fold cross-validation on our 81 candidates. We will see if the HalvingGridSearchCV process can find the same hyperparameters in less time.

%%time
from sklearn.model_selection import GridSearchCV
full_results = GridSearchCV(**grid_search_params)\
.fit(train_df, y_train)
pd.DataFrame(full_results.best_params_, index=[0])\
.assign(RMSLE=-full_results.best_score_)
Fitting 3 folds for each of 81 candidates, totalling 243 fits
Wall time: 32min 53s
png

2. HalvingGridSearchCV with n_samples

In the first halving grid search, I used the default ‘n_samples’ for the resource and set the min_resources to use 1/4 of the total resources, which was 365 samples. I did not use the default min_resources calculation of 22 samples because it produced terrible results.

For both halving searches, I used a factor of 2. This parameter determines the n_candidates and n_resources used in the successive iterations and indirectly determines the total number of iterations utilized in the search.

  1. The reciprocal of the factor determines the proportion of n_candidates retained - in this case, one half. All other candidates are discarded. Hence, as you can see in the logs below, the 3 iterations in my search had 81, 41 and 21 candidates.
  2. The product of the factor and the previous iteration’s n_resources determines the n_resources. My 3-iteration search used 365, 730 and 1460 samples.
  3. The total number of iterations is determined by how many times n_resources can increase by the factor while not exceeding the max_resources. If you want the final iteration to use all of the resources, you will need to set min_resources and factor to be factors of max_resources.
%%time

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
FACTOR = 2
MAX_RESOURCE_DIVISOR = 4

n_samples = len(train_df)
halving_results_n_samples =\
HalvingGridSearchCV(resource='n_samples',
min_resources=n_samples//\
MAX_RESOURCE_DIVISOR,
factor=FACTOR,
**grid_search_params
)\
.fit(train_df, y_train)
n_iterations: 3
n_required_iterations: 7
n_possible_iterations: 3
min_resources_: 365
max_resources_: 1460
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 81
n_resources: 365
Fitting 3 folds for each of 81 candidates, totalling 243 fits
----------
iter: 1
n_candidates: 41
n_resources: 730
Fitting 3 folds for each of 41 candidates, totalling 123 fits
----------
iter: 2
n_candidates: 21
n_resources: 1460
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Wall time: 34min 46s

This first halving search did not produce good results. It actually took a little longer than the exhaustive search. Using my compare_cv_best_params function, we see that it found only the ninth optimal hyperparameter set.

from compare_functions import *compare_cv_best_params(full_results, *[halving_results_n_samples])\
.style.applymap(lambda cell: ‘background: pink’ if cell == 9 else)

3. HalvingGridSearchCV with n_estimators

In the second halving search, I used CatBoost’s n_estimators as the resource and set the first iteration’s min_resources to use a quarter of those estimators while keeping the factor set at 2.

%%time
halving_results_n_estimators =\
HalvingGridSearchCV(resource='model__n_estimators',
max_resources=1000,
min_resources=1000 // MAX_RESOURCE_DIVISOR,
factor=FACTOR,
**grid_search_params
)\
.fit(train_df, y_train)
n_iterations: 3
n_required_iterations: 7
n_possible_iterations: 3
min_resources_: 250
max_resources_: 1000
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 81
n_resources: 250
Fitting 3 folds for each of 81 candidates, totalling 243 fits
----------
iter: 1
n_candidates: 41
n_resources: 500
Fitting 3 folds for each of 41 candidates, totalling 123 fits
----------
iter: 2
n_candidates: 21
n_resources: 1000
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Wall time: 22min 59s

This halving search produced the results that we were hoping to see. It was finished 10 minutes earlier, so it was about 30% faster than the exhaustive grid search. Importantly, it also found the best set of hyperparameters.

compare_cv_best_params(full_results, *[halving_results_n_samples, 
halving_results_n_estimators])\
.style.apply(lambda row: \
row.apply(lambda col: \
'background: lightgreen' if row.name == 2 else ''), \
axis=1)

Conclusion

The results of my HalvingGridSearchCV experiment were mixed. Using the default “n_samples” resource yielded slow and suboptimal results. If you are not using a large number of samples, limiting them may not save you any time.

However, using CatBoost’s n_estimators as the resource yielded the optimal results in less time. This tracks with my own experience manually tuning gradient boosting hyperparameters. I can usually tell pretty quickly from the validation logs whether the hyperparameter set is worth boosting for many more rounds.

Let me know if you found this post helpful. The original notebook for this blog post can be found here.

Follow me and stay tuned for further posts on training models with Scikit-Learn. Thanks!

--

--