Effortless distributed training for PyTorch models with Azure Machine Learning and PyTorch-accelerated

A comprehensive guide to help you get up and running with the AzureML CLI v2

Published in

Towards Data Science

12 min readDec 13, 2021

When training Deep Learning models, there often becomes a point when using a single machine just isn’t working for you anymore, usually because you need more GPUs to speed up training! Even for experienced practitioners, moving from training a model on your local machine to a cluster in the cloud can be a daunting prospect. Usually, this process involves to having to update your PyTorch training script to include steps such as initialising a process group and wrapping the model with `DistributedDataParallel`, as well as provisioning and managing the infrastructure; ensuring that environments are consistent and that the correct variables are set etc. Thankfully, tools such as PyTorch-accelerated and Azure Machine Learning (AzureML) make this process as painless as possible, requiring minimal, if any, code changes to your scripts!

The purpose of this article is to demonstrate how simple it can be to run distributed training jobs using AzureML and to help you get training as quickly as possible! As an example, we shall be exploring how to train an image classification model on the Imagenette dataset; imagenette is a subset of 10 easily classified classes from Imagenet. However, whilst we shall use this training script as an example, our focus is primarily on the overall workflow, as opposed to trying to maximise performance on the given task. The task itself is the least important part of this article, as the idea is that you will feel comfortable changing the script to one of your own!

Disclaimer: Whilst I work for Microsoft, I am not asked to, or compensated for, promoting Azure Machine Learning in any way. In the Data Intelligence & Design team, we pride ourselves on using what we feel are the best tools for the job depending on the situation and the customer that we are working with. In cases that we choose not to use Microsoft products, we provide detailed feedback to the product teams on the reasons why, and the areas where we feel things are missing or could be improved; this feedback loop usually results in Microsoft products being well suited for our needs. Here, I am choosing to promote the features of Azure Machine Learning because the CLI v2 is my personal tool of choice for cloud-based training.

Running Training Locally

Before we look at AzureML, let’s understand, and verify, how we can run the training script locally.

Downloading the dataset

First, let’s create a folder for our data and download the Imagenette dataset:

Defining the training script

Now that we have our data downloaded, let’s create folders for our training task and outputs, and write our script:

Here, we are training a Resnet-RS50 model from the excellent timm library — this is an improved version of the standard ResNet architecture, and I’d recommend using this in place of a regular ResNet50 — with an AdamW optimizer and one-cycle learning rate schedule; I find this configuration to be a good default for most image classification tasks.

So that we don’t need to write the training loop, or manage moving data to different devices, we are using the Trainer from PyTorch-accelerated to handle these concerns for us. This way, our code will remain the same, regardless of whether we are training on a single GPU, or multiple GPUs distributed across different nodes. As it is recommended to launch PyTorch-accelerated scripts using the Accelerate CLI, this also means that the command that we use will also remain consistent, regardless of our hardware.

If you are unfamiliar with PyTorch-accelerated and would like to learn more about it before diving into this article, please check out the introductory blog post or the docs; alternatively, it’s very simple and a lack of knowledge in this area should not impair your understanding of the content explored here!

Running training locally

Let’s verify that the training script runs on our local machine. First, let’s create a config file to specify our hardware options using the accelerate CLI. We can do this by running the following command, and answering the questions:

accelerate config --config_file train_imagenette/accelerate_config.yaml

We can generate the following yaml file:

We can now launch our training run using the accelerate launch command:

accelerate launch --config_file train_imagenette/accelerate_config.yaml \
train_imagenette/train.py --data_dir data/imagenette2-320 --epochs 1

which will produce the following output:

To use an additional GPU for training, we need to modify our accelerate config file. One way of doing this would be to use the same process as before. Alternatively, as it is only the number of processes that we would like to change, we can override this attribute directly from the command line, as demonstrated below. To avoid having to keep run the `accelerate config` command between different runs, we will make use of this later when using AzureML.

accelerate launch --config_file train_imagenette/accelerate_config.yaml --num_processes 2 \
train_imagenette/train.py --data_dir data/imagenette2-320 --epochs 1

We can see from the output that, as two GPUs were used, half as many steps were needed during the training and validation epochs.

Now that we have verified that we can run our script locally, let’s explore how we can use AzureML to scale up our training.

If you prefer to use vanilla PyTorch for your script, you can still follow the approach outlined in this article, but you will have to handle all the intricacies of distributed training within your script manually and make sure that you use the correct launch command for your hardware!

Distributed Training on AzureML

AzureML offers two different approaches for training models:

Using the CLI v2 (my personal preference)
Using the Python SDK

Here, we shall focus on using the CLI v2.

The AzureML CLI v2 offers a command line-based approach to model training, where our configuration is defined in yaml files. Whilst this may be a little daunting if you are used to working with SDKs in notebooks, it is straightforward once you understand what the available options are!

Before we can jump into training, there are some prerequisites that need to be taken care of. To follow this guide, you will need:

An Azure Subscription — it is easy to sign up for a subscription here, which includes free credits for the first 30 days, and is then free for usages under certain thresholds.
To install the Azure CLI
To create an Azure Machine Learning workspace — This is straightforward to do and can be done using either the portal or the CLI
To install the AzureML CLI v2

The versions used at the time of writing are displayed below:

Note: it is important that the extension for the CLI v2 displays as ‘ml’, if you are seeing ‘azure-cli-ml’, this is the CLI v1, and the following steps will not work!

To get an overview of the available functionality in the Azure ML CLI v2, we can run the command:

Once everything is installed, we can get started!

Adding logging to the training script (optional)

Although our script is ready to run as-is, the only output that will be captured is the standard out logs, which will be written to a text file when running on AzureML. To help us track and compare different experiment runs, without having to look through the logs, we can set up a logger to record certain metrics and artefacts to display on the AzureML portal.

Using the azureml-mlflow package, we can log using the MLFlow fluent API; more information on using MLFlow with AzureML can be found here.

With PyTorch-accelerated, we can create a callback for this, as illustrated below:

As we can see, this doesn’t require much code at all! Here, we have subclassed an existing PyTorch-accelerated callback, and overridden the logging method to log to MLFlow; this is only done from the main process across all nodes to stop the same metrics being recorded twice. Usually, when using MLFlow, we would have to start and end runs explicitly, but AzureML will handle this for us; as well as making sure that an environment containing the tracking URI is set for us. In addition to recording metrics, we are also going to record the Trainer’s run configuration, which contains information such as the number of training epochs and whether mixed precision was used. Capturing this information will help us to replicate training conditions in the future if we need to re-run the same experiment.

Additionally, as we saw earlier, PyTorch-accelerated displays a progress bar by default. Let’s remove this, to keep the logs as clear as possible. We can do this by removing the ProgressBarCallback.

Let’s update our script to reflect these changes.

Here we can see that we have updated the list of callbacks that are passed to the Trainer. As we are saving our model to the ‘.outputs’ folder, this will be stored for us by AzureML.

Defining our training environment

Now that our training code is ready, we need to define the environment where our code will be executed. AzureML provides curated environments — intended to be used as is — as well the option to create environments by specifying a conda dependencies file, which defines the specified packages to install onto a default base docker image.

Whilst these approaches can be a great way to get started, as these are built on top of ‘default’ docker images, and environments created this way often include many packages that you haven’t specified and don’t need; installing additional packages into these environments can create conflicts that you aren’t aware of.

An alternative to these approaches is to define our own docker image. Personally, especially in a production setting, I like to have complete control over all aspects of the environment and tend to find that the small amount of time that I spend creating a docker image when first setting up an experiment can save me from longer times spent debugging environment issues later!

For this experiment, we can use the base PyTorch image to define a custom Dockerfile as presented below. As our base image contains PyTorch, and all the necessary cuda libraries, the only packages that we need to specify are the AzureML components, MlFlow and the dependencies needed to run our training script.

This is demonstrated below:

We could also copy our training script into the docker image at this point, but we will get AzureML will do this for us later! To aid with this, we are going to use a separate directory within our experiments folder to store our Dockerfile, which will be used as the docker build context; AzureML will only trigger the image build process when something in this folder changes. By using AzureML to copy our script into the environment, we can avoid triggering the process to build and push the image each time we make a change to our script.

More information about creating Dockerfiles can be found here.

Register Dataset

Another thing that we must do is to upload our dataset to the cloud. Whilst there are multiple ways that we can do this - depending on where the data is located - we will do this by registering a dataset from a local folder. We can do this by defining the following yaml file:

We can now use the CLI to register this dataset in our workspace and upload the files with the command:

az ml data create -f data/register_dataset.yaml \
--resource-group myResourceGroup --workspace-name myWorkspace

Whilst there are other options for accessing data within a training job, registering and versioning datasets is a recommended approach to enable tracking and reproducibility of experiments.

Create compute target

The next thing to do is to define the compute target where we would like to run our training script. We can understand the options available either by visiting the Azure documentation or by running the following command:

Here, we shall use a Standard_NV24, which has 4 NVIDIA Tesla M60 GPUs.

We can define and provision our compute cluster as follows:

az ml compute create -f infra/create_compute_target.yaml \
--resource-group myResourceGroup --workspace-name myWorkspace

As we have defined a range of instances, AzureML will manage scaling the cluster based on demand.

Define training job configuration

Now, it’s time to define the config for our training run! This is done using a job, which specifies the following:

What to run
Where to run it
How to run it

Let’s define our job config file and then break it down:

First, we have specified what we would like to run, which is very similar to the accelerate launch command that we used to run training locally; the key difference is that we have overridden some of the properties of the accelerate config file.

Here, we have used the environment variables which are automatically set by AzureML to specify the values for machine rank, main process IP address and main process port, which are required for distributed training. This will ensure that the correct values are used on each node, so that we don’t need to create separate config files for each machine.

We have also defined some inputs, which are passed as arguments to our script. By defining an inputs section, we have the option of overriding these values from the command line, so that we don’t have to modify our yaml script for different hyperparameter configurations.

To access our dataset, as it is registered in our workspace, we can access this using the `azureml` prefix, followed by the name that it was registered as and the dataset version tag; AzureML will mount this dataset to each compute instance, and pass the correct data path to our training script.

Next, we have defined where we would like to run the code:

Here, we specify the environment — which is the directory containing the Dockerfile we created earlier — the local path to our code, and the compute target that we provisioned. On submission, AzureML will build the docker image and then push this image to the container registry; this will be cached until changes are made. For each run, the everything from the ‘code’ path will be copied into the docker image and executed on the specified compute target.

Finally, we have defined how we would like to run the job, which in this case, is specifying our distributed training configuration.

As the accelerate launch command will handle the task of creating a process for each GPU, we only need to execute a single process per machine. In the resources section, we have specified how many instances we would like to use for training.

To understand all the available options for configuring a job, you can check out the documentation or inspect the command job schema.

Launching training on AzureML

Now, let’s launch our training job with the following command. By default, our config specifies using a single GPU on a single machine, so we are overriding some of these values so that we use eight GPUs across two machines.

az ml job create -f train_imagenette/train_config.yaml --set \
 inputs.num_machines=2 \
 inputs.num_processes=8 \
 resources.instance_count=2 \
--resource-group myResourceGroup --workspace-name myWorkspace

Now, if we navigate to the experiments section in our AzureML workspace, we should be able to see our training run!

Here we can see that some graphs have been plotted based on the values that we logged during training; this view can be customised to your liking. As we didn’t specify a display name in our config file, a random one has been assigned by AzureML; if we use the same display name for multiple runs, it will make it more difficult to associate the metrics displayed on the graphs with the correct run.

We can also select our run to see additional information, including the tags that were logged from the Trainer’s run configuration.

Exploring the tabs, we can see a more granular view of the metrics that were logged, as well as a snapshot of the code used.

Looking in the `Outputs + Logs` tab, we can see the logs that have been generated by our script, as well as the model checkpoint which we saved to the outputs folder:

Conclusion

Hopefully that has demonstrated how easy it can be to get started with distributed training for PyTorch models on Azure using the powerful combination of Azure Machine Learning and PyTorch-accelerated!

Chris Hughes is on LinkedIn.