Many Data Scientists ignore hyperparameters. Hyperparameter tuning is a highly experimental activity, and such uncertainty can lead to severe discomfort in any normal human being, something we naturally attempt to avert.
"Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations […]." – Will Koehrsen, Hyperparameter tuning the Random Forest in Python
Unfortunately, it ought to be done. We don’t walk into a store, pick a pair of trainers off the shelf, and buy them. We first select a shoe we believe will solve our problem, whether that’s a wardrobe malfunction or for whatever reason, we’ve lost all our trainers. Next, we tune the hyperparameters such as the size of the shoe and the color we want before we make the purchase.
If we are willing to do this in the real world, it shouldn’t be skipped in Data Science.
Understading Hyperparameters
Hyperparameter optimization is the problem of selecting the optimal set of hyperparameters for a learning algorithm. By determining the right combination of hyperparameters, the model’s performance is maximized – meaning our learning algorithm makes better decisions when provided unseen instances.
Values selected as hyperparameters control the learning process, therefore, they are different from normal parameters since they are selected prior to training a learning algorithm.
Formally, model hyperparameters are parameters that cannot be estimated by the model when provided the data, hence they need to be set beforehand to estimate the model’s parameters. In contrast, model parameters are estimated by the learning model from the provided data.
Approaches
There’s a number of approaches to efficiently perform hyperparameter optimization – see Hyperparameter Optimization on Wikipedia for a full breakdown. Mind Foundry conducted a survey on Twitter to learn the sentiments of practitioners on the platform.
Let’s learn more about each of them and how to perform them in Python.
Bayesian Optimization
Wikipedia describes Bayesian Optimization as "a global optimization method for noisy black-box functions. Applied to hyperparameter optimization, Bayesian optimization builds a probabilistic model of the function mapping from hyperparameter values to the objective evaluated on a validation set. By iteratively evaluating a promising hyperparameter configuration based on the current model, and then updating it, Bayesian optimization aims to gather observations revealing as much information as possible about this function and, in particular, the location of the optimum. It tries to balance exploration (hyperparameters for which the outcome is most uncertain) and exploitation (hyperparameters expected close to the optimum). In practice, Bayesian optimization has been shown to obtain better results in fewer evaluations compared to grid search and random search, due to the ability to reason about the quality of experiments before they are run." [Source: Wikipedia].
Scikit-Optimize (skopt) is an optimization library that has a Bayes optimization implementation. I’d recommend using this implementation rather than trying to implement your own solution – although there’s value in implementing your own solution if you would like to dive deeper into how Bayesian optimization works. See the code below for an example.
Grid Search
Grid search was the first technique I learned to perform hyperparameter optimization. It consists of exhaustively searching through a manual subset of specific values of the hyperparameter space in a learning algorithm. Performing Grid search means there must be a performance metric guiding our algorithm.
Instead of implementing Grid search from scratch, it’s highly recommended that you use the Sklearn implementation. See the code below for an example.
Random Search
Instead of exhaustively enumerating through all of the combinations that you list in a Grid search, random search selects combinations at random. When a small number of hyperparameters have an effect on the final model performance, the random search can outperform grid search – despite its low rating in the survey above, random search is still quite an important technique to have in your toolkit.
Like Grid search, Random search has a Scikit Learn implementation that would be better to use rather than your own solution. See the code below.
Final Thoughts
From my experience, having a good understanding of the learning algorithm you’re using and how the hyperparameters affect its behavior helps when performing hyperparameter optimization. Although it’s one of the most important tasks, I feel like hyperparameter tuning doesn’t get the recognition it deserves or it could be that I don’t see it enough. Nevertheless, it’s an extremely important part of your project and should never be overlooked.
Thanks for Reading!
If you enjoyed this article, connect with me by subscribing ** to my FRE**E weekly newsletter. Never miss a post I make about Artificial Intelligence, Data Science, and Freelancing.