Single line distributed PyTorch training on AWS SageMaker

How to iterate faster on your data science project, to let your brilliant idea to see the light of day

Published in

Towards Data Science

8 min readOct 12, 2020

It doesn’t matter what research project I’m working on, having the right infrastructure in place is always a critical part for success. It’s very simple: assuming all other “parameters” are equal, the faster/cheaper one can iterate, the better results could be achieved for the same time/cost budget. As these are always limited, good infrastructure is usually a make or break for a breakthrough.

When it comes to data science, and deep learning specifically, the ability to easily distribute the training process on a strong HW, is a key for achieving fast iterations. Even the brightest idea requires a few iterations to get polished and validated, e.g. for checking different pre-processing options, network architectures, and just standard hyperparameters such as batch size and learning rate.

In this post, I’d like to show you how easy (and cheap, if you want) it is to distribute existing distribution-ready PyTorch training code on AWS SageMaker using simple-sagemaker (assuming distribution is done on top of the supported gloo or nccl backends). Moreover, you’ll see how to easily monitor and analyze resource utilization and key training metrics in real time, during the training.

Simple-sagemaker

Simple-sagemaker is thin wrapper around AWS SageMaker, that makes distribution of work on any supported instance type very simple and cheap. A quick introduction can be found in this blog post.

Distributing PyTorch model training in a single line

I’ll be basing the demonstration on PyTorch’s official ImageNet example. In fact, I’ll be running it as-is on AWS by using just a single command!

To be more precise, I assume:

You’ve already installed simple-sagemaker:
pip install simple-sagemaker
Your AWS account credentials and default region are configured for boto3, as explained on the Boto3 docs.
The ImageNet data is already stored (and extracted, more on that later) on a S3 bucket, e.g. s3://bucket/imagenet-data.
You’ve downloaded the training codemain.py to the current working directory.

Now, to run the training code on a single p3.2xlarge instance, just run the following command:

That’s it, take a sit while the training job is running. You may actually need to take a sleep as it should take ~8 hours for 10 epochs on the complete dataset (total ~1.2M images). Don’t worry, the trained model will be waiting for you under output/state by the end.

Some explanation on the ssmcommand:

shell — Run a shell task.
-p imagenet -t 1-node— Name the project and task.
-o ./output1 --download_state — A local path to download output by the end of the training + request to download the state directory (logs get downloaded by default as well).
--iis — Use the imagenet data from s3://bucket/imagenet-data as a input channel mapped to the $SM_CHANNEL_IMAGENET path.
--it ml.p3.2xlarge — Set the instance type to ml.p3.2xlarge.
-d main.py — Add main.py as a dependency.
-v 280— Set the EBS volume size to 280 GB.
--no_spot — Use on-demand instances instead of the spot instances (the default). More expensive, but time is what we’re trying to save here.
--cmd_line —Execute the training script.

A few notes:

Each task is assigned a dedicated S3 path under [Bucket name]/[Project name]/[Task name]. A few directories exist under that path, but the relevant one for us now is state , which gets continuously synchronized to to S3 and persists over consecutive execution of the same task.
As the script keeps checkpoints on the current working directory, we change it first to the state directory — $SSM_INSTANCE_STATE
World size and node rank are set based on the environment variables accessible to the worker code. A complete list of these can be found here.
--resume — Resume the training in case it was stopped, based on the saved checkpoint in the state directory.
If the data was already on you local machine, you could use the -i argument (instead of --iis) to get it automatically uploaded to S3 and used as the “data” input channel.

For more information and options, e.g. if you need to customize the docker image or to use a local dataset for the training, read the documentation, run ssm shell -h to get the help on the command line parameters, and/or review the introduction blog post.

Let's take a look at the end of the output logs at ./output1/logs/log0:

* Acc@1 46.340 Acc@5 72.484
Total time: 26614 seconds

We achieved 46.3 top-1 accuracy and 72.484 top-5 accuracy, with total training time of 26,614 seconds = ~7:23 hours. The total 8 hours of running time is due to an overhead, mostly due to downloading (and extracting) the input data (~150 BG). Roughly 3–4 minutes out of it are due to the “standard SageMaker overhead” — launching and preparing the instance, downloading the input data and the training image. This can get a bit longer when training on spot instances.

Distributing the training

Going back to the time budget you have, ~8 hours may be too much for you to wait. As explained above, it may even mean in some cases that your brilliant idea isn’t going to see the light of day :(.

Luckily, the ImageNet example code is well written, and can easily be accelerated by distributing the training on a few instances. Moreover, it’s going to be just a single additional argument!

That’s it again. Setting the instance count to 3(--ic 3) is all you need to get the same job done in ~3:51 hours!

Looking on the output logs ./output2/logs/log0 we see that the same accuracy is achieved, for less than half of the training time — 11,605 seconds = ~3:13 hours!

* Acc@1 46.516 Acc@5 72.688
Total time: 11605 seconds

Monitoring and optimizing the training process

So, you’ve saved the time, and your brilliant idea is closer to the light of day. But, how can you easily watch and monitor the progress to make the chances even higher?

First, take a look on the SageMaker training jobs console, and select the 3-nodes job. Here should be able to to get all the available information about it:

There’s much more information available there, including e.g. links to the state (checkpoints) and output on S3, feel free to explore it.

The most interesting part for us now is the “Monitor” section where you can get graphs of instance utilization (CPU, RAM, GPU, GPU memory, Disk) and algorithm metrics (more about it in a second).

Links to the full logs and to a dynamic CloudWatch graph system for the instance and for the algorithm metrics are on the top of that section, and are the way to go for the analysis.

Here’re the instance metrics from the two training cycles:

The graphs get updated in real time, and should allow you to make sure everything is going as expected, and whether there’re any crucial bottlenecks that should be taken care of.

The rule of thumb is to make sure the GPU load is close to 100% most of the time. Taking our 3 nodes example in the graph above, it can be seen that the GPU is only ~70% loaded, which means we may be able to get more from that HW. A good guess may be to have more data loading worker threads, to push data faster toward the GPU.

Algorithm metrics

To further simplify the monitoring of the training process, we can use SageMaker metrics to get training specific metrics in real time. Again, we just need to update a few more parameters:

--md is used to define the metrics, where the first argument is the name, e.g. "loss" and the second is a regular expression that extracts the numeric value from the output logs.

Fast forward, here’re the algorithm metrics from the two training cycles:

A complete distributed ImageNet training pipeline

The LSVRC2012 ImageNet data can be downloaded from the image-net.org website. The simple steps described by Google here are very helpful.

As the full dataset is ~1.2M images with a total size of almost 150GB, downloading it locally, then uploading to S3 bucket is going to take a lot of time. In addition, synchronizing that many files from S3 to the training instances before training will be very slow as well. The following strategy is used to overcome these issues:

The data is downloaded using a dedicated processing task, equipped with a much faster S3 connection.
The data on S3 is kept in ~1000 .tar files, and get extracted on the training instance, right before the training.

The processing task can easily be launched by using the followingssm command:

Where download.sh is a bash script that downloads and arranges the data to the path it gets on the first argument ($SSM_OUTPUT/data). Once that task is completed, thedata directory with train and val subfolders get placed under the output directory of the task dedicated S3 path [Bucket name]/[Project name]/[Task name]. The training command should now be updated as well:

Two changes were introduced here:

In order to “chain” the two tasks, making that output of the processing task be the input of the training task, the argument--iit is now used instead of--iis. This maps the train and valsubfolders to input channels with the same names, accessible by the SSM_CHANNEL_TRAINand SSM_CHANNEL_VAL environment variables.
The shell script extract.sh is used to extract the data right before the training.

A complete pipeline that downloads the data and executes a few training alternatives can be found on the simple-sagemaker repository.

Summary

An easy distributed model training setup is a must-have tool for your data science projects. It’s now at the reach of your hand, simpler and easier than you ever thought. Utilize it!