State-of-the-Art Machine Learning Hyperparameter Optimization with Optuna

Optuna is an advanced hyperparameter optimization framework with visualizations for interpretability

Yenwee Lim

Published in

Towards Data Science

10 min readMar 31, 2022

Photo by ThisisEngineering RAEng on Unsplash

Introduction

In this article, we will be discussing Optuna, a hyperparameter optimization software framework, particularly designed for machine learning pipelines. Optuna enables users to adopt state-of-the-art algorithms for sampling hyperparameters and pruning unpromising trials. This helps to speed up optimization time and performance greatly compared to traditional methods such as GridSearch. It also allows users to plot optimization histories for a better understanding of the model.

Before we jump right into it, let’s have a short refresher on hyperparameters and discuss the differences between traditional optimization methods and the latest and more advanced optimization methods that can be applied in Optuna’s framework.

Overview of hyperparameters

Hyperparameters are variables that control a machine learning model’s learning process. They ultimately dictate how the model learns a specific relationship between input and predictions.

A very simple example is that whether “to fix an intercept onto a simple Linear Regression or not” is a hyperparameter of the Linear Regression model.

Optimizing a model’s hyperparameters to solve a specific problem is often necessary. Here’s some reasons why:

Machine learning hyperparameters are not one-size-fits-all as these models can require different constraints to generalize different out-of-sample data patterns and different problems.
Hyperparameter optimization allows us to yield an optimal model with the best sets of hyperparameters. This model should be capable of giving optimal results that minimizes loss function.

Overview of optimization methods

The problem with optimization problems is that, often the search space can be indefinite and the budget to perform this search is constrained (maximum x amount of time or iterations). Therefore, to search for the global minimum efficiently, an algorithm has to implement ways to utilize the budget efficiently.

Traditional methods

The traditional way of performing hyperparameter optimization has been exhaustive and uninformed about previous information. Some examples are:

Grid search, exhaustively searches through a predefined subset.
Random search selects randomly from a predefined subset.

As data volume and hyperparameter space grow, some of the traditional search methods deteriorate rapidly. They either spend too much time or can’t even locate the minima.

Bayesian method

Bayesian optimization methods search for global optimization by iteratively building a probabilistic model of the function mapping from hyperparameter values to the objective function. The probabilistic model captures beliefs about the behavior of the function to form a posterior distribution over the objective function.

After that, the posterior distribution is further used to form an acquisition function that determines the next point with the best improvement probability.

However, as it uses available information to make decisions, an exploration vs exploitation problem arises.

Exploration is similar to a global search; we are interested in exploring the hyperparameter search space looking for even better solutions.
Exploitation is similar to a local search; we want to refine our current solution and try to avoid wasting precious resources on undesirable search spaces.

Multiple algorithms are then proposed to optimally balance the exploration-exploitation dilemma. Some examples are: Tree-structured Parzen Estimator (TPE) and Gaussian Process Regressor e.g. Kriging.

Early stopping method

Early stopping-based hyperparameter optimization algorithms use statistical tests to discard search spaces that perform poorly and will not contain a global minimum.

It works by checking the intermediate scores of a model with the specific set of hyperparameters. For example, while training a neural network, let’s say 16 different sets of hyperparameters are selected for the network. After five epoches, we check for all the intermediate scores and discard those hyperparameters that perform badly.

Two popular early stopping optimization algorithm are successive halving (SHA) and Hyperband.

However, for models that do not yield intermediate results, it is practical not to use early-stopping-based methods.

Evolutionary method

Evolutionary optimization methods search for the space of hyperparameters using evolutionary algorithms. This method was inspired by the concept of evolution by Charles Darwin. It generally follows a process following the biological concept of evolution:

An initial sample is drawn from the population hyperparameters search space.
The hyperparameters are evaluated using a fitness function and ranked by their relative fitness.
The worst-performing hyperparameters are discarded and new hyperparameter sets are generated through crossover and mutation.
The evolutionary processes are repeated until a Budget constraint or until performance does not improve.

Introducing, Optuna.

Optuna is a hyperparameter optimization software framework that is able to easily implement different state-of-the-art optimization methods to perform hyperparameter optimization rapidly with great performance.

By default, Optuna implements a Bayesian optimization algorithm (TPE) but it can be easily switched to other existing algorithms in the package.

Optimization algorithms in Optuna

Optuna names its optimization algorithms into two different categories, namely the sampling strategy and the pruning strategy.

Sampling strategy: algorithms that select the best parameter combination by concentrating on areas where hyperparameters are giving better results.
Pruning strategy: Early-stopping-based optimization methods as we discussed above.

We will briefly discuss the intuition behind an example for the three types of algorithms discussed in the previous section so that we can have a general understanding of how these algorithms work.

However, we will not go too much in-depth as that will take some time to explain. These algorithms are all readily defined in the package and available for production use.

Sampling Strategy

TPESampler (Tree-Structured Parzen Estimator):

A Bayesian optimization algorithm that:

First, randomly selects a subset of hyperparameters and sort them based on their scores.
The hyperparameters are further divided into two groups based on some predefined quantile.
The two groups are then modeled into estimated densities l(x1) and g(x2) using Parzen Estimators (kernel density estimators).
Locate the hyperparameters with highest expected improvement [lowest l(x1)/g(x2)].
The hyperparameter with highest expected improvement are evaluated, sorted and divided again. This process repeats until the Budget is finished and the best hyperparameters will be returned.

NSGAIISampler (Non-dominated sorting genetic algorithm II)

A Evolutionary optimization algorithm with multiple objective functions.

An individual (A) is said to dominate another individual (B), if

There is no objective of A worse than that objective of B
There is at least one objective of A better than that objective of B.

Main process of the algorithm:

Initially, a random parent population is sampled and each solution evaluated is assigned a fitness rank equal to its nondomination level. It first selects all the non-dominated solutions from population P and assigns them to rank 1, then selects all from the remaining solutions and assigns them to rank 2, and so on so forth, until all individuals have been assigned to a rank.
Two random trials are picked and the better one becomes Parent 1. The process is repeated once to select another Parent 2 [Binary tournament selection].
These two parents recombine to produce offspring which go into the child population [Recombination]. The child undergoes mutation and changes some of its values [Mutation]. This step is repeated until you have twice the initial population size.
The population is sorted according to nondomination again. A new generation will be chosen in the order of rank. Crowding-sort will be implemented to calculate the density of solutions if the particular rank is only partially included for the next generation. The less dense trials are chosen into the next generation until the population count reaches the initial population size again.
The new generation are reproduced and discarded again iteratively until the maximum number of generations is met and the best hyperparameters will be returned.

Some other popular sampling strategies: CMA-ES Sampler, MOTPE Sampler

Pruning Strategy

SuccessiveHalvingPruner (Asynchronous Successive Halving)

Randomly selects an initial set of hyperparameter values.
Train the trials for 1 epoch until the defined maximum number of trials is reached.
At the same time, a trial is simultaneously promoted to another rung (similar to rank) to train more epochs whenever the trial’s score is among the top d percent within the rung where d is a predefined divisor.

Note that this is different to Synchronous Successive Halving where the algorithm waits for all defined trials in a rung to finish their epochs and only then decides which trials are having the top scores to be promoted to train more epochs in another rung.

Some other popular pruning strategies: MedianPruner, HyperbandPruner

Code Implementation

The objective function

Before we start implementing optimization with Optuna, it requires us to define an objective function.
The objective function will contain the entire logic of a regular model definition, training and testing process. After the model evaluation, it should return the evaluation metric which is also picked by the user.

The trial

The Trial class will be used to store information of one specific combination of hyperparameters later used by the machine learning model.

import optuna
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_scoredef objective(trial):
    digits = sklearn.datasets.load_digits()
    x, y = digits.data, digits.target    max_depth = trial.suggest_int("rf_max_depth", 2, 64, log=True)
    max_samples = trial.suggest_float("rf_max_samples", 0.2, 1)
   
    rf_model = RandomForestClassifier(
        max_depth = max_depth,
        max_samples = max_samples,
        n_estimators = 50,
        random_state = 42)score = cross_val_score(rf_model, x, y,  cv=3).mean()return score

The study

A Study object can then be called to optimize the objective function to find the best hyperparameters combination. It will then run trials iteratively until a user-defined maximum trial or time. The trial with the best hyperparameters will be stored in study.best_trial.

study = optuna.create_study(direction = "maximize")
study.optimize(objective, n_trials = 100)trial = study.best_trial
print("Best Score: ", trial.value)
print("Best Params: ")
for key, value in trial.params.items():
    print("  {}: {}".format(key, value))

After that, we can proceed to plot some visualizations in the Optuna package to understand the relationship between the hyperparameters.

optuna.visualization.plot_contour(study)

optuna.visualization.plot_param_importances(study)

Hyperparameter importances plot by author

We can see that the main hyperparameter affecting performance of the model is max_depth. In the contour plot, as max depth increases, performance increases regardless of the max_samples.

Implementing Pruning

In many cases where a machine learning model does not yield any intermediate results, pruning implementation is not so practical. However, for cases where intermediate values are present, such as neural networks, pruning is a great technique to reduce computation time.

One example of a model that does have intermediate results is Stochastic Gradient Descent.

First, we initialize an objective function.

import optuna
from optuna.pruners import SuccessiveHalvingPruner
from optuna.samplers import TPESampler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifierdef objective(trial):# Loading the data set and splitting into train, test sets
    iris = load_iris()
    classes = list(set(iris.target))
    train_x, valid_x, train_y, valid_y = train_test_split(
        iris.data, iris.target, test_size=0.25
    )    # Prompting Optuna to suggest a hyperparameter value
    alpha = trial.suggest_float("alpha", 1e-5, 1e-1, log=True)    sgd = SGDClassifier(alpha = alpha, random_state = 42)    # Report the intermediate score for every step
    for step in range(100):
        sgd.partial_fit(train_x, train_y, classes=classes)        # Report the intermediate objective value.
        intermediate_value = sgd.score(valid_x, valid_y)
        trial.report(intermediate_value, step)        # Prune the intermediate value if neccessary.
        if trial.should_prune():
            raise optuna.TrialPruned()    return sgd.score(valid_x, valid_y)

After that, we can create a study with Optuna and optimize the objective function.

study = optuna.create_study(sampler = TPESampler(), 
                            pruner = SuccessiveHalvingPruner(),
                            direction= "maximize")study.optimize(objective, n_trials = 20)
pruned_trials = study.get_trials(states=[optuna.trial.TrialState.PRUNED])
complete_trials = study.get_trials(states=[optuna.trial.TrialState.COMPLETE])
print("# Pruned trials: ", len(pruned_trials))
print("# Complete trials: ", len(complete_trials))trial = study.best_trial
print("Best Score: ", trial.value)
print("Best Params: ")
for key, value in trial.params.items():
    print("  {}: {}".format(key, value))

We will be seeing another two plots that has been provided by Optuna. These plots can help us check the optimization history and see which trials are pruned.

optuna.visualization.plot_optimization_history(study)

optuna.visualization.plot_intermediate_values(study)

We can see how the pruners work by stopping the undesirable hyperparameters with bad intermediate values and focus on training the better hyperparameters.

Takeaway

Optuna allows us to implement state-of-the-art optimization algorithms to speed up the hyperparameter tuning process that is essential in a machine learning pipeline.

However, we must understand that these advanced algorithms are built to efficiently search for the best objective when the cost to iterate the model training process is too expensive.

If we are only training with a small amount of data while using a model that is less complex, it is possible for GridSearch and RandomSearch to beat these algorithms in terms of speed and performance.
That being said, as data volume gets larger and models get more complex, the cost of randomly training a set of hyperparameters without any information increases tremendously, and these advanced algorithms greatly outperform traditional methods.

Thank you so much for taking your time to read this piece of writing.

References

[1] T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama. Optuna: A Next-generation Hyperparameter Optimization Framework (2019), arXiv.

[2] L. Li, K. Jamieson, A. Rostamizadeh, E. Gonina, M. Hardt, B. Recht, A. Talwalkar. A System for Massively Parallel Hyperparameter Tuning (2018), arXiv.

[3] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. Algorithms for hyper-parameter optimization (2011), In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS’11). Curran Associates Inc., Red Hook, NY, USA, 2546–2554.

[4] K. Deb, A. Pratap, S. Agarwal and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II (2002), IEEE Transactions on Evolutionary Computation, vol. 6, no. 2, pp. 182–197. doi: 10.1109/4235.996017.