GridSearchCV or RandomSearchCV?

Comparing two sklearn hyperparameter tuning tools…

Brunna Torino
Towards Data Science

--

Hyperparameter tuning is a powerful tool to enhance your supervised learning models— improving accuracy, precision, and other important metrics by searching the optimal model parameters based on different scoring methods. There are two main options available from sklearn: GridSearchCV and RandomSearchCV. I will go through each one of them quickly and then do a full comparison of the two methods!

Since this is Python-based tutorial, enjoy this amazing close-up of a real Python:

Photo by Divide By Zero on Unsplash

GridSearchCV

GridSearchCV implements the most obvious way of finding an optimal value for anything — it simply tries all the possible values (that you pass) one at a time and returns which one yielded the best model results, based on the scoring that you want, such as accuracy on the test set.

Using GridSearchCV is also very simple, with a few customizable parameters. Here’s a breakdown of them:

1.estimator: the model you are using. 
2.params_grid: the dictionary object that holds the hyperparameters you want to test.
3.scoring: evaluation metric
4.cv: number of cross-validation for each set of hyperparameters
5.verbose: The higher, the more messages are going to be printed.
6.n_jobs: Number of jobs to run in parallel
7.pre_dispatch: controls the number of jobs that can be done in parallel (to avoid memory issues)
8. iid : assumes data is identically independently distributed. Default is False.
9.refit :
once best params are found, refit the estimator
10. error_score :
value to assign to the score if an error happens when fitting the estimator
11.return_train_score: include train scores in cv_results_

To implement GridSearchCV when fitting you model is just as simple:

  1. First, you define the possible values of all the parameters, using np.linspace for example or just a list of values
  2. Then you build a “grid” with all the parameter names (as in documentation) and their possible values that you would like to test.
  3. Then, you instantiate the GridSearchCV method and fit it to your data (X_train and y_train). Here, I am using the rf_grid as the parameters to be tested, using an accuracy score, and 5-fold cross-validation
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

max_features = ['auto', 'sqrt']

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]
rf_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
model = GridSearchCV(RandomForest(), rf_grid, scoring = 'accuracy', cv = 5)# fit the modelmodel.fit(X_train, y_train)

There is a long list of different scoring methods that you can specify for you GridSearchCV, accuracy being the most popular for classification problems.

RandomSearchCV

RandomSearchCV has the same purpose of GridSearchCV: they both were designed to find the best parameters to improve your model. However, here not all parameters are tested. Rather, the search is randomized and all the other parameters are held constant while the parameters we are testing is variable.

Pratically, the implementation of RandomSearchCV is very similar to that of the GridSearchCV:

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]

max_features = ['auto', 'sqrt']

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]
rf_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
model = RandomizedSearchCV(estimator = RandomForest(),
param_distributions = rf_grid,
cv = 5, n_iter = 100)
# fit the modelmodel.fit(X_train, y_train)

The main difference between the pratical implementation of the two methods is that we can use n_iter to specify how many parameter values we want to sample and test.

There is an obvious trade-off between n_iter and the running time, but (depending on how many possible values you are passing) it is recommended to set n_iter to at least 100 so we can have a higher confidence in the results of the algorithm.

RandomSearchCV vs. GridSearchCV

With GridSearchCV, by calling the method best_params_ you are guaranteed to get the best model results (according to your scoring) within your test values, since it will test every single one of the values you passed.

However, with RandomSearchCV, the more samples you test from the value set, the more confident the search will be — but it will never be 100% certain (unless you test every value out of the set of possible values). Statistically speaking, we can be fairly confident that the best parameters found are indeed the best combination of optimal parameters since the search is completely randomized.

The running times of RandomSearchCV vs. GridSearchCV on the other hand, are widely different. Depending on the n_iter chosen, RandomSearchCV can be two, three, four times faster than GridSearchCV. However, the higher the n_iter chosen, the lower will be the speed of RandomSearchCV and the closer the algorithm will be to GridSearchCV.

In conclusion…

If you have a small combination of parameters, but with large sets of possible values — along with a model that uses a lot processing time — then RandomSearchCV will save you a lot of time, while still giving you a good estimation of the optimal parameters.

Furthermore, you can use the results of the model to run RandomSearchCV again but now with a smaller set of possible values: or better yet, run GridSearchCV on the small set of possible values after having a rough idea of where the optimal parameter is with RandomSearchCV!

If your model does not take a lot of time to train, or if you already have a rough idea of where the optimal values are (due to inference, or theoretical knowledge), you should definitely use GridSearchCV as it will give you 100% certainty about which parameters you passed that produce the optimal model results.

Other Options

If you are still not happy with your hyperparameter tuning, make sure to check out the GPOPY algorithm: https://github.com/domus123/gpopy

GPOPY implements the Generic Algorithm, which will generate a few combinations of parameters and at each step choose the best combinations, also called “parents”, and cross-over between them, generating more combinations with the characteristics from both of the parent combinations. Faster than GridSearchCV, but a little bit smarter than RandomSearchCV!

--

--