The world’s leading publication for data science, AI, and ML professionals.

Getting Accurate Scikit Learn models using Optuna: A Hyper-parameter framework

Optuna, Scikit-learn, hyper-parameter optimization, Logistic Regression, Random Forest

Hyper-parameter frameworks have been quite in the discussions in the past couple of months. With several packages developed and still in progress, it has become a tough choice to pick one. Such frameworks not only help fit an accurate model but can help boost Data scientists’ efficiency to the next level. Here I am showing how a recent popular framework Optuna can be used to get the best parameters for any Scikit-learn model. I have only implemented Random Forest and Logistic Regression as an example, but other algorithms can be implemented in a similar way shown here.

Why Optuna?

Optuna can become one of the work-horse tools if integrated into everyday experimentations. I was deeply impressed when I implemented Logistic Regression using Optuna with such minimal effort. Here are a couple of reasons why I like Optuna:

  • Easy use of API
  • Great documentation
  • Flexibility to accommodate any algorithms
  • Features like pruning and in-built great visualization modules

Documentation: https://optuna.readthedocs.io/en/stable/index.html

Github: https://github.com/optuna/optuna


Before we start looking at the functionalities, we need to make sure that we have installed pre-requisite packages:

  1. Optuna
  2. Plotly
  3. Pandas
  4. Scikit-Learn

Basic parameters and defining:

Setting up the basic framework is pretty simple and straightforward. It can be divided broadly into 4 steps:

  1. Define an objective function (Step 1)
  2. Define a set of hyperparameters to try (Step 2)
  3. Define the variable/metrics you want to optimize(Step 3)
  4. Finally, run the function. Here you need to mention:
  • the scoring function/variable ** you are trying to optimize is to be maximized or minimize**d
  • the number of trials you want to make. Higher the number of hyper-parameters and more the number of trials defined, the more computationally expensive it is (unless you have a beefy machine or a GPU!)

In the Optuna world, the term Trial is a single call of the objective function, and multiple such Trials together are called Study.

Following is a basic implementation of Random Forest and Logistic Regression from scikit-learn package:

When you run the above code, the output would be something like below:

Output in terminal or Jupyter notebook
Output in terminal or Jupyter notebook

As we can see above the selection of Logistic Regression and Random Forest with their respective parameters varies in each run. Each Trial can be a different algorithm with different parameters. The study object stores variety of outputs and can be retrieved as follows:

As we can see here Random Forest with _nestimators as 153 and _maxdepth of 21 works best for this dataset.

Defining parameter spaces:

If we look in Step 2 (basic_optuna.py) we defined our hyper-parameter C to have a log of float values. Similarly, for Random Forest we have defined max_depth and n_estimators as parameters to optimize. Optuna supports five ways in which we can define the parameters:

Historical Studies:

Photo by Tanner Mardis on Unsplash
Photo by Tanner Mardis on Unsplash

I feel, one of the essential needs of a data scientist is that they would like to keep a track of all the experiments. This helps not only to compare any two, three, or multiple of them but also understand how the model behaves with a change in either hyper-parameters, adding new features, etc. Optuna has in-built functionality to keep a record of all the experiments. Before accessing old experiments we need to store them. The code below shows how to execute both of them:

  1. You can create an experiment with a name of choice
  2. Store as Relational Databases (RDB) form. I am using sqlite here. The other options are to use PostgreSQL, MySQL. You can also store and load using joblib as a local pkl file.
  3. Continue to run study after that

Here storage is an extra parameter to be passed in _createstudy function if you would like to use RDB storage option. Also, setting _load_ifexists = True will load an already existing study. With joblib it is similar to how a trained model is stored and loaded. Running again will start to optimize for # of trials based on the last stored Trial.

One of the great ways to get a detailed overview of all the trials in the study can be obtained if we use:

Results of all the trials
Results of all the trials

As you can see from the results above, there is NaN value for columns/parameters which do not apply to that algorithm. Beyond this, we can also calculate the total time each trial takes to compute. Sometimes the time consideration is essential because it gives an idea if particular sets of parameters take a long time to fit as compared to others.

Distributed Optimization:

You can run multiple jobs on your machine for hyper-parameter optimization. Running distributed hyper-parameter optimization using Optuna is pretty simple. I consider being one of the boons using Optuna as a hyper-parameter optimization framework. Consider the same objective function we defined and storing as a python file called optimize.py. The parameter suggestions will be based on the history of the trials and updated, whenever we run in multiple terminals:

Running in two separate terminals the output will be something like below:

Terminal 1 Output
Terminal 1 Output
Terminal 2 Output
Terminal 2 Output

Comparing Terminal 1 Output and Terminal 2 Output, we can see different parameters are selected for Random Forest and Logistic Regression. In Terminal 1, we see only Random Forest was selected for all the trials. In Terminal 2, only 1 Trial of Logistic Regression was selected. You can see the Trial # is different for both the output. Moreover, you can also increase the number of jobs if you multiple cores available using _njobs parameter, making it even faster.

Adding Attributes:

Photo by Patrick Perkins on Unsplash
Photo by Patrick Perkins on Unsplash

Making some notes or attributes to the experiment can help a lot when evaluating historical experiments. User can add key-value pair using _set_userattr method for both trial and study as shown below:

One can access the attributes using _userattrs method on a study object as shown in the code above. I consider this as an accessory available when using Optuna.

Pruning:

https://cornellfarms.com/blogs/resources/pruning-fruit-trees
https://cornellfarms.com/blogs/resources/pruning-fruit-trees

As shown in the picture, pruning removes unwanted or extra branches in the tree. Similarly, in the world of Machine Learning Algorithms, pruning is the process of removing sections that provide little power to a classifier and helps reducing overfitting of the classifier, in return, providing better accuracy. Machine Learning Practitioners must be well-versed with the term early-stopping, which is similar to how pruning works in Optuna.

Output after running pruning.py (Code above)
Output after running pruning.py (Code above)

As shown above, for specific trials, if the result is not better than the intermediate result, the trial gets pruned( here Trial 7 and Trial 10). Other types of pruners that are also available in Optuna, like MedianPruner, NopPruner, PercentilePruner, SuccessiveHalvingPruner, etc. You can try and get more information here.

Visualization:

Apart from the capabilities shown above, Optuna offers some already pre-written visualization codes. This visualization module is like a cherry on the cake and makes it even better to understand the fitting of an algorithm. Some of them I have plotted below:

Optimization History Plot:

Optimization History Plot is showing the objective value obtained for each trial and plotting the best value. The best value is a flat line until the next best value is achieved, as shown in the plot below:

Optimization History Plot
Optimization History Plot

Parallel Coordinate Plot:

Parallel Coordinate Plot helps to understand the high-dimensional parameter relationships in a study. Here we just choose Random Forest, and we see the x-axis as the parameters _maxdepth and _nestimators v/s the objective value on the y-axis. We can see that the trial for the best optimization value is max_depth ~ 23 and n_estimators ~ 200.

Parallel Coordinate Plot
Parallel Coordinate Plot

Slice Plot:

This plot helps in plotting the relationship between all the parameters that are passed to optimized. Slice Plot is similar to Parallel Coordinate Plot shown above.

Slice Plot
Slice Plot

All the visualizations shown here are just for two parameters. One can find it very useful if multiple parameters are used. The plots become denser and can provide a clear picture of the relationship between the parameters. There are many other visualizations implemented which can be found here.

Conclusion:

Optuna is not limited to use just for scikit-learn algorithms. Perhaps, neural networks like TensorFlow, Keras, gradient-boosted algorithms like XGBoost, LightGBM, and many more can also be optimized using this fantastic framework. Some of the examples by Optuna contributors can already be found here. Optuna is one of the best versatile frameworks I have come across. As I mentioned before, and what sets apart is excellent documentation, supports mostly all the algorithms, flexibility to modify as per the need, etc. Beyond this the Optuna community has already built many wrappers on the top of the framework and still growing, taking care of a lot of heavy lifting work. Optuna is overall a tremendous hyper-parameter framework to include as a part of your data science toolkit, and I would highly recommend to any data scientist.


Related Articles