MLOps - Hyperparameters Tuning with MLflow and Hydra Sweeps

Introduction

When we develop Machine Learning models, we usually need to run lots of experiments to figure out which hyperparameter setting is best for a given algorithm. This can often lead to dirty code and losing track of which result matches which setting. I’ve often seen people hard-code the hyperparameters, launch the experiment, and write down the result in an Excel file. I am sure we can improve this workflow.

In my last article, I talked about pipelining using MLflow.

MlOps – A gentle introduction to Mlflow Pipelines

Today I want to add a layer of complexity and explain how to also integrate Hydra, which is a fantastic open-source tool that among other things allows you to run tests with different model settings.

I run the scripts of this article using Deepnote: a cloud-based notebook that’s great for collaborative data science projects and prototyping.

Let’s code!

So let’s start coding the first Python file to train a simple ML model such as the Random Forest. We can use the Titanic dataset, and import it through the Seaborn library.

The Titanic dataset has an open-source (MIT license) license. You can find it on GitHub at this URL.

So within our project, let’s create a subfolder that will be a component of the MLflow pipeline and call it "random_fortest". Inside this component then we create the script "run.py" and write the code to do the model training.

In this snippet, we will implement standard preprocessing and training steps that I think need no explanation.

One thing you may have noticed is that the Random Forest accepts as input many hyper-parameters, which we now have hardcoded. So to improve our code, we can take these parameters as arguments to the go() function. Then what we can do is create a YAML file in which we specify these hyper-parameters. Afterwards, we will ask the user to pass us via CLI the path to the YAML file he/she wants to use and from where we read the model settings.

This is how we can modify the script. We are going to leverage the argparse library to allow users to specify some input parameters (like the model config file) from CLI.

As you learned in my previous article about MLflow pipelines, an MLflow component also needs a conda.yaml in which we specify the development environment. So here is the yaml that we can use:

name: random_forest
channels:
  - conda-forge
  - defaults
dependencies:
  - pandas
  - pip
  - scikit-learn
  - matplotlib
  - plotly
  - pillow
  - mlflow
  - seaborn
  - pip:
      - omegaconf

But we also know that an MLflow component needs also an MLproject file:

name: random_forest
conda_env: conda.yml
entry_points:
  main:
    parameters:
      model_config:
        description: JSON blurb containing the configuration for the decision tree
        type: str
    command: >-
      python run.py --model_config {model_config}

In this MLproject we name the component, define the conda file to use, and take in input parameters for model configuration.

Now outside the component, as the main entry, we create other files:

another conda.yaml to specify the environment of the entry point component
an MLproject to define the MLflow pipeline steps
a config file where we enter the hyperparameters to be used
a main.py script from where our project starts

Let’s start by defining the config.yaml file. There are many hyperparameters you can try. I have chosen only a few, feel free to experiment with others.

random_forest:
  n_estimators: 100
  criterion: "gini"
  max_depth: null
  min_samples_split: 2
  min_samples_leaf: 1

We now specify the conda file for this entry point.

name: main_component
channels:
  - conda-forge
  - defaults
dependencies:
  - requests
  - pip
  - mlflow
  - hydra-core
  - pip:
      - wandb
      - hydra-joblib-launcher

Let us now finally define how main.py must be coded to use Hydra easily.

To use the config file within our Python code, we simply use the Hydra decorator where we specify the name of the config file, and Python will automatically read it. Let’s look at an example.

As you can see the configuration file is read thanks to the decorator. After that, I create at the fly a new yml file called _random_forestconfig.yml to pass to the run.py script. If you remember this script expected a yaml file with the random forest configuration.

Why then create a second yaml instead of passing the config.yaml directly to it? Simply because in the config.yaml I might have settings for other things besides the random forest as well. In this case, there are none because it is a toy example, so we might as well have skipped this step.

We launch the "random_forest" component of the pipeline with the mlflow.run() command

Now we could run the main.py script with special commands derived from Hydra to change some settings in the random forest configuration file in an easy way to experiment quickly.

But we know very well that we will not run the command "python main.py" by hand but we will define this command in the MLproject. So in the MLproject we will have to say that the command "python main.py" may be accompanied by other parameters accepted by Hydra.

name: run_pipeline
conda_env: conda.yaml

entry_points:
  main:
    parameters:
      hydra_options:
        description: Hydra parameters to override
        type: str
        default: ""
    command: >-
      python main.py $(echo {hydra_options})

If you see in fact the command we run has an extra piece of code:

python main.py $(echo {hydra_options}).

In this way, we add small changes to the commands that are accepted by Hydra. We will see examples in a moment.

Now if we launch the whole thing with "mlflow run ." the whole pipeline will be launched.

mlflow run .

It will take some time because MLflow will have to generate the necessary environments specified in the conda.yaml files.

The second time you launch the pipeline will be faster because MLflow is smart enough to figure out that that environment already exists and there is no need to create it again.

As you can see in my case everything went well without errors and I can see the accuracy achieved by the model.

Having such a pipeline is very convenient because now I can test a model with different hyperparameters just by changing a yaml file.

Hydra Sweeps

Let’s take advantage of the code we wrote to launch more experiments in an easy way. In the MLproject in the entry point, we said that we accept input parameters that are useful for changing the behavior of Hydra.

We can change some fields in the config file directly from CLI. If you read the MLproject carefully you see that we can change a parameter called "hydra_options." So what we do is we define within this parameter the values in the config file that we want to call.

For example, if I want to do a test, with a number of estimators equal to 30, I can run Mlflow pipeline in the following way.

mlflow run . -P hydra_options="random_forest.n_estimators=30"

Of course, we can specify multiple parameter changes at the same time. In the next example, I change both _nestimators and _min_samplesplit:

mlflow run . -P hydra_options="random_forest.n_estimators=100 random_forest.min_samples_split=5"

Simple isn’t it? Now to launch various experiments just change one string from CLI!

However, we have not seen how this can greatly speed up the hyperparameter tuning phase. Suppose that the result of each experiment is saved to a file instead of just printed at the console (you can try to implement this yourself).

We can tell Hydra, via CLI, to try combinations of different values for each parameter with one command. After that, we will check the logs of the results and choose the best hyperparameters.

To do this we use the multi-run function of Hydra because many runs will have to be launched for each of the combinations of the hyperparameters.

The various values to be tested for each individual parameter should be separated by commas. At the end of the string, we add a "-m" to indicate that it is a multirun. Here is a practical example:

mlflow run . -P hydra_options="random_forest.n_estimators=10,50,100 random_forest.min_samples_split=3,5,7 -m"

Here we have launched finally our hyperparameter tuning with Hydra!

If we want to try all the numbers from x1 to x2, we can use the function range(x1,x2). For example, if for min_samples_split I want to try all the numbers from 1 to 5 I can use the command like this:

mlflow run . -P hydra_options="random_forest.n_estimators=10,50,100 random_forest.min_samples_split=range(1,5)  -m"

Conclusion

MLflow and Hydra are fantastic tools for working on Data Science projects. With some initial effort, they then allow us to launch experiments without commitment. This way we can dedicate ourselves to understanding why some experiments work better than others, and draw our own conclusions without having to go and tweak the code every time.

In a future tutorial, I will also show how to save the results of the experiments to an external tool to keep track of them. I usually use Weight & Biases, but MLflow itself also provides this facility.

If you are interested in this article follow me on Medium!