Faster Hyperparameter Tuning with Scikit-Learn’s HalvingGridSearchCV
Comparing Halving Grid Search to the Exhaustive GridSearchCV
If you are a Scikit-Learn fan, Christmas came a few days early in 2020 with the release of version 0.24.0. Two experimental hyperparameter optimizer classes in the model_selection module are among the new features: HalvingGridSearchCV and HalvingRandomSearchCV.
Like their close cousins GridSearchCV and RandomizedSearchCV, they use cross-validation to find optimal hyperparameters. However, instead of independently searching the hyperparameter set candidates, their successive halving search strategy “starts evaluating all the candidates with a small number of resources and iteratively selects the best candidates, using more and more resources.” The default resource is the number of samples, but the user can set it to any positive-integer model parameter like gradient boosting rounds. Thus, the halving approach has the potential of finding good hyperparameters in less time.
My Experiment
I read through Scikit-Learn’s “Comparison between grid search and successive halving” example, but because takes a grand total of 11 seconds to run, I was still unclear about the real-world impact of using the halving versus exhaustive approach. So I decided to set up an experiment to answer the following questions:
- How much faster is HalvingGridSearchCV compared to GridSearchCV?
- Does HalvingGridSearchCV still select the same hyperparameter set that GridSearchCV does?
I’m going to run and compare 3 searches:
- GridSearchCV
- HalvingGridSearchCV using the default “n_samples”
resource
- HalvingGridSearchCV using the CatBoost’s “n_estimators” as the
resource
Upgrade Scikit-Learn
The first step is to upgrade your version of Scikit to 0.24.0 and make sure you can import the correct version.
# !! pip install scikit-learn --upgrade
import sklearn
print(sklearn.__version__)0.24.0
Loading the Dataset
I ran my tests using the Kaggle’s Ames, IA house prices dataset. It has 1,460 observations and 79 features. The dependent variable is SalePrice
of the home. I recommend reading this notebook if you are interested in some exploratory data analysis on the dataset.
import numpy as np
import pandas as pd
DEP_VAR = 'SalePrice'
train_df = pd.read_csv('../kaggle/input/house-prices-advanced-regression-techniques/train.csv')\
.set_index('Id')
y_train = train_df.pop(DEP_VAR)
Creating a Pipeline & Model
I also wrote a script called pipeline_ames.py. It instantiates a Pipeline containing some feature transformations and the CatBoostRegressor. I’ve plotted its visual representation below. (You can read more about my approach to feature engineering in my previous post.)
from sklearn import set_config
from sklearn.utils import estimator_html_repr
from IPython.core.display import display, HTML
from pipeline_ames import pipe
set_config(display='diagram')
display(HTML(estimator_html_repr(pipe)))
Experimental Controls
The grid_search_params
dictionary contains the control parameters that were used in the 3 searches. I performed 3-fold cross-validation on param_grid
, which contains 4 CatBoost hyperparameters with 3 values each. The results were measured in root mean squared log error (RMSLE).
from sklearn.metrics import mean_squared_log_error, make_scorer
np.random.seed(123) # set a global seed
pd.set_option("display.precision", 4)
rmsle = lambda y_true, y_pred:\
np.sqrt(mean_squared_log_error(y_true, y_pred))
scorer = make_scorer(rmsle, greater_is_better=False)
param_grid = {"model__max_depth": [5, 6, 7],
'model__learning_rate': [.01, 0.03, .06],
'model__subsample': [.7, .8, .9],
'model__colsample_bylevel': [.8, .9, 1]}
grid_search_params = dict(estimator=pipe,
param_grid=param_grid,
scoring=scorer,
cv=3,
n_jobs=-1,
verbose=2)
Tests
1. GridSearchCV
The baseline exhaustive grid search took nearly 33 minutes to perform 3-fold cross-validation on our 81 candidates. We will see if the HalvingGridSearchCV process can find the same hyperparameters in less time.
%%time
from sklearn.model_selection import GridSearchCVfull_results = GridSearchCV(**grid_search_params)\
.fit(train_df, y_train)pd.DataFrame(full_results.best_params_, index=[0])\
.assign(RMSLE=-full_results.best_score_)Fitting 3 folds for each of 81 candidates, totalling 243 fits
Wall time: 32min 53s
2. HalvingGridSearchCV with n_samples
In the first halving grid search, I used the default ‘n_samples’ for the resource
and set the min_resources
to use 1/4 of the total resources, which was 365 samples. I did not use the default min_resources
calculation of 22 samples because it produced terrible results.
For both halving searches, I used a factor
of 2. This parameter determines the n_candidates
and n_resources
used in the successive iterations and indirectly determines the total number of iterations utilized in the search.
- The reciprocal of the
factor
determines the proportion ofn_candidates
retained - in this case, one half. All other candidates are discarded. Hence, as you can see in the logs below, the 3 iterations in my search had 81, 41 and 21 candidates. - The product of the
factor
and the previous iteration’sn_resources
determines then_resources
. My 3-iteration search used 365, 730 and 1460 samples. - The total number of iterations is determined by how many times
n_resources
can increase by thefactor
while not exceeding themax_resources
. If you want the final iteration to use all of the resources, you will need to setmin_resources
andfactor
to be factors ofmax_resources
.
%%time
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCVFACTOR = 2
MAX_RESOURCE_DIVISOR = 4
n_samples = len(train_df)
halving_results_n_samples =\
HalvingGridSearchCV(resource='n_samples',
min_resources=n_samples//\
MAX_RESOURCE_DIVISOR,
factor=FACTOR,
**grid_search_params
)\
.fit(train_df, y_train)n_iterations: 3
n_required_iterations: 7
n_possible_iterations: 3
min_resources_: 365
max_resources_: 1460
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 81
n_resources: 365
Fitting 3 folds for each of 81 candidates, totalling 243 fits
----------
iter: 1
n_candidates: 41
n_resources: 730
Fitting 3 folds for each of 41 candidates, totalling 123 fits
----------
iter: 2
n_candidates: 21
n_resources: 1460
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Wall time: 34min 46s
This first halving search did not produce good results. It actually took a little longer than the exhaustive search. Using my compare_cv_best_params function, we see that it found only the ninth optimal hyperparameter set.
from compare_functions import *compare_cv_best_params(full_results, *[halving_results_n_samples])\
.style.applymap(lambda cell: ‘background: pink’ if cell == 9 else)
3. HalvingGridSearchCV with n_estimators
In the second halving search, I used CatBoost’s n_estimators
as the resource and set the first iteration’s min_resources
to use a quarter of those estimators while keeping the factor
set at 2.
%%time
halving_results_n_estimators =\
HalvingGridSearchCV(resource='model__n_estimators',
max_resources=1000,
min_resources=1000 // MAX_RESOURCE_DIVISOR,
factor=FACTOR,
**grid_search_params
)\
.fit(train_df, y_train)n_iterations: 3
n_required_iterations: 7
n_possible_iterations: 3
min_resources_: 250
max_resources_: 1000
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 81
n_resources: 250
Fitting 3 folds for each of 81 candidates, totalling 243 fits
----------
iter: 1
n_candidates: 41
n_resources: 500
Fitting 3 folds for each of 41 candidates, totalling 123 fits
----------
iter: 2
n_candidates: 21
n_resources: 1000
Fitting 3 folds for each of 21 candidates, totalling 63 fits
Wall time: 22min 59s
This halving search produced the results that we were hoping to see. It was finished 10 minutes earlier, so it was about 30% faster than the exhaustive grid search. Importantly, it also found the best set of hyperparameters.
compare_cv_best_params(full_results, *[halving_results_n_samples,
halving_results_n_estimators])\
.style.apply(lambda row: \
row.apply(lambda col: \
'background: lightgreen' if row.name == 2 else ''), \
axis=1)
Conclusion
The results of my HalvingGridSearchCV experiment were mixed. Using the default “n_samples” resource yielded slow and suboptimal results. If you are not using a large number of samples, limiting them may not save you any time.
However, using CatBoost’s n_estimators
as the resource yielded the optimal results in less time. This tracks with my own experience manually tuning gradient boosting hyperparameters. I can usually tell pretty quickly from the validation logs whether the hyperparameter set is worth boosting for many more rounds.
Let me know if you found this post helpful. The original notebook for this blog post can be found here.
Follow me and stay tuned for further posts on training models with Scikit-Learn. Thanks!