Illustration photo by Mikhail Nilov from Pexels

Hyperparameter Optimization with Grid.ai and No Code Change

This post presents a hyper-parameter sweep search using the Grid.ai platform, allowing parallelization over multiple spot instances and live performance observation.

--

In a previous post, we showed how to convert scientific notebooks to a standard python package that is ready to share. We then demonstrated how simple it is to convert a vanilla python script to a training script with a command-line interface (CLI) for faster hyper-parameters interaction.

With our CLI configured, we can now pass different hyperparameters such as learning rate or model architecture to our model without needing to change any code… but how do we identify which combinations of these parameter values result in the best performing model?

Hyper-parameter Optimization with Grid.ai

How to pick an optimal set of hyperparameters is one of the most common questions in machine learning. Many Kaggle masters claim that the correct set of parameter values is the difference between a model that ranks on the leaderboard and an irrelevant model. A simple model with vital parameters can get a 5–15% performance boost from basic hyperparameter optimization.

On the other hand, for most beginners performing parameter tuning can be a time-consuming and overwhelming task. A common but naive approach to hyperparameter tunning is to sequentially run multiple experiments and write configuration values and results to an Excel table.

More advanced Kagglers may use a simple parallelized hyper-parameter search and track results with MLFlow, Weights&Biases, Neptune.ai, etc. However, if you take this approach, you need to train on a powerful machine with multiple GPUs or create and orchestrate your cloud with a few machines.

Note there are several other options for hyper-parameter search if you have your own powerful machine, such as Optuna, Ray tune, etc.

Full disclosure — I currently work as a Sr. Research Engineer at Grid.ai
Note that there are other alternatives that you can also use to leverage these best practices, such as Kaggle kernels or Colab but Grid.ai is my platform of choice as it enables me to easily scale training my models using the cloud.

Luckily, Grid Runs enables quick hyperparameter tuning on our code without worrying about orchestrating complex infrastructure or adding external libraries to our code. Grid Runs automatically collect logs and manage parallel experimentation… without requiring any changes to our code!

Creating a Datastore

In most large-scale projects, sharing data is a significant bottleneck for fast experimentation, as it requires secure parallel compute resources to access. Luckily, Grid Datastores are optimized, allowing your models to train at peak speed without taking on technical debt or needing to navigate the complexities of optimizing cloud storage. Moreover, Grid Datastores come with integrated dataset versioning that ensures that our hyperparameter experiments are comparable and reproducible Machine Learning!

Earlier, we preprocessed/downscaled the dataset in Grid interactive session. Now, we will upload this dataset to the newly created datastore. Uploading data from the sessions ensures faster uploads. Later, we will access this datastore with Grid Runs to run a hyper-parameter search to get the best performances from our model.

grid datastore create \
--name kaggle_plant-pathology \
--source plant-pathology

For more details about datastores, see docs.

Creating/uploading Grid datastore.

When creation is finished, we can quickly check some basic information about all our Datastores from the web UI, such as size or last update.

See the created datastore in UI.

Hyper-Parameter Finetuning

Now we can pick up where we left off with the training script. We will now show how to run 9 experiments with combinations of three different learning rates and three ResNet model backbones. What’s more, you can easily scale the number of experiments and hyperparameters in the command line or UI without needing to change a single line of code.

The following steps assume that we have already created the GitHub repository from the data science kernel. Now we can link the training script on which we want to run the hyper-parameter search. If you haven’t created a GitHub repo yet, please check this post, or you can follow along with this repo that I have taken the liberty to prepare for this post.

We then select the machine type we want to use. In our case, we need a GPU to train our model, so I chose a single T4 per experiment, using spot instances to significantly lower costs. The 9 experiments (each took about 30min to train) cost less than $1 worth of free credits to tune a rankable model.

Process of creating grid search with Grid runs.

To prepare our hyperparameter search, all we need to do is replace our single CLI’s argument values with a range (list, sampling, etc.) of pythonic values we want to explore:

--model.model "['resnet18', 'resnet34', 'resnet50']" \
--model.lr "[0.0001, 0.0005, 0.001]" \
--data.base_path grid:kaggle_plant-pathology:1 \
--trainer.max_epochs 5

For each experiment, we mount our data from the datastore we created in the last step, so there is no need to download data locally, significantly speeding up training time.

Monitor Training in Real-Time and Save the Best Model

Monitoring experiments enables you to both debug and terminate models that are not converging. This helps to reduce costs and saves resources that can be reinvested in training better models.

PyTorch Lightning uses TensorBoard by default logger, which is natively integrated with the Grid.ai platform. TensorBoard allows us to monitor and compare results aggregated over all experiments effortlessly.

Browsing aggregated results from all experiments to find the model with the best F1 score.

PyTorch Lightning provides automatic checkpointing by default. When the best experiment is found, we can open its details window and navigate the artifact tab to download the saved checkpoint, which we will use in inference.

Downloading artifacts from a particular experiment.

In this post, we have talked about the need for hyperparameter searchers, and we have walked throw Grid Runs which simplifies fine-tuning in the cloud. We can even monitor our training online and eventually terminate some by our choice with all default configurations.

In the future, we will use our best-trained model and show how to prepare the Kaggle submission kernel running offline for scoring in the competition.

Stay tuned and follow me to learn more!

About the Author

Jirka Borovec has been working in Machine learning and Data science for several years in a few different IT companies. In particular, he enjoys exploring interesting world problems and solving them with state-of-the-art techniques. In addition, he developed several open-source python packages and actively participating in other well-known projects. He works in Grid.ai as Research Engineer and serves as a lead contributor of PyTorchLightning.ai.

--

--

I have been working in ML and DS for a while in a few IT companies. I enjoy exploring interesting world problems and solving them with SOTA techniques…