Examples of local minima and maxima (Image by Author — Mayrhofen 2019)

Hyperparameter Tuning Methods - Grid, Random or Bayesian Search?

A practical guide to hyperparameter optimization using three methods: grid, random and bayesian search (with skopt)

Maria Gorodetski
Towards Data Science
4 min readAug 28, 2021

--

When I was working on my last project, I got a new chunk of data after I trained the first version of the model. That day, I felt a little lazy and tried to retrain my model with the new data using the same model type and hyperparameters. Unfortunately, it didn't follow my expectations. Instead of improving, due to the increase in data amount, it performed slightly worse. The added data was distributed differently than the data I had before, and its amount was not negligible comparing to the origin, contrary to my expectation.

So, I found myself in an unknown territory, without any hunch of which hyperparameters I should use. I tried some options manually, but there were too many possible combinations, and I had trouble managing my experiments. At that point, I decided to dive deeper into the hyperparameter tuning world.

What should you expect from this post?

  1. Introduction to hyperparameter tuning.
  2. Explanation about hyperparameter search methods.
  3. Code examples for each method.
  4. Comparison and conclusions.

For all the examples in the post, I used Kaggles’ heart-attack-analysis-prediction-dataset.

I have prepared a simple pipeline that is used in all the examples.

What is hyperparameter tuning, and why is it important?

Hyperparameters are the variables of the algorithm that control its whole behavior. It affects its speed, resolution, structure, and eventually performance. Sometimes it has only a small effect, but in others, it is crucial. A good example is the learning rate. When it is too large, the learning isn’t sensitive enough, and the model results are unstable. But when it is too small, the model has trouble learning and might stuck.

When the algorithm has many parameters, it is very hard to try all the possible combinations to find the best set. For that reason, we would like to do hyperparameter tuning efficiently and in a manageable way.

Types of Hyperparameter Search

There are three main methods to perform hyperparameters search:

  1. Grid search
  2. Randomized search
  3. Bayesian Search

Grid Search

The basic way to perform hyperparameter tuning is to try all the possible combinations of parameters. For example, if you want to tune the learning_rate and the max_depth, you need to specify all the values you think will be relevant for the search. Then, when we run the hyperparameter tuning, we try all the combinations from both lists.

In the following example, I tried to find the best values for learning_rate (5 values), max_depth (5 values), and n_estimators (5 values) — 125 iterations in Total.

Random Search

Unlike the Grid Search, in randomized search, only part of the parameter values are tried out. The parameter values are sampled from a given list or specified distribution. The number of parameter settings that are sampled is given by n_iter. Sampling without replacement is performed when the parameters are presented as a list (like the grid search). But if the parameter is given as a distribution, sampling with replacement is used (recommended).

The advantage of randomized search, in my experience, is that you can extend your search limits without increasing the number of iterations (time-consuming). Also, you can use it to find narrow limits to continue a thorough search in a smaller area.

In the following example, I used parameters distributions for sampling with replacement.

Bayesian Search

The main difference between Bayesian search and the other methods is that the tuning algorithm optimizes its parameter selection in each round according to the previous round score. Thus, instead of randomly choosing the next set of parameters, the algorithm optimizes the choice, and likely reaches the best parameter set faster than the previous two methods. Meaning, this method chooses only the relevant search space and discards the ranges that will most likely not deliver the best solution. Thus, it can be beneficial when you have a large amount of data, the learning is slow, and you want to minimize the tuning time.

Same as in the random search example, I used in this example parameter distributions for sampling:

Visualization of parameter search — learning rate

The best learning rate parameter in this comparison is 0.008 (found by the bayesian search).

Visualization of the mean score for each iteration

We can see that the bayesian search outperforms the other methods by a little. This effect is much more noticeable in larger datasets and more complex models.

Discussion and Conclusions

I ran the three search methods on the same parameter ranges. The grid-search ran 125 iterations, the random and the bayesian ran 70 iterations each. This data set is relatively simple, so the variations in scores are not that noticeable. Still, the random search and the bayesian search performed better than the grid-search, with fewer iterations. The bayesian search found the hyperparameters to achieve the best score.

For further reading on the subject, I recommend reading the following excellent post: https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f.

Good luck!

--

--

I'm a data scientist at DayTwo, where I combine my programming and bioinformatical skills for solving healthcare problems.