The world’s leading publication for data science, AI, and ML professionals.

Distributed Training on AWS SageMaker

Learn about data parallelism and model parallelism options available for distributed training on AWS SageMaker.

Photo by Alina Grubnyak on Unsplash
Photo by Alina Grubnyak on Unsplash

In today’s world, when we have access to humongous data, deeper and bigger deep learning models, training on a single GPU on a local machine can pretty soon become a bottleneck. Some models won’t even fit on a single GPU and even if they do the training could be painfully slow. Running a single experiment could take weeks and months in such a setting i.e. large training data and model. As a result, it can hamper research and development and increase the time taken for making POCs. However, to our relief cloud compute is available which allows one to set up remote machines and configure them as per the requirements of the project. These are scalable (both up and down) and don’t need to be maintained by you. So that overhead is handled by the company providing you with the service. Some of the major providers are: Amazon AWS, Google Cloud, IBM Cloud, Microsoft Azure, etc. (in alphabetical order.) We will look at distributed training using Aws Sagemaker to tackle the scaling problem.

Distributed training

It helps to address the challenges of scaling model size and training data [1]. As discussed earlier, while increasing model size and complexity can improve performance (depending on the problem statement), there is a limit to the model size that can be fit into a single GPU. Moreover, scaling model size can lead to result in more computations and training time. The same is true if you have very large datasets.

In distributed training, the workload to train the model is split up and shared among multiple mini processors, called worker nodes [2]. These worker nodes work in parallel to speed up model training.

Based on how we chose to split the work, here are two main types of distributed training namely data parallelism and model parallelism.

Data parallelism

This is the most common approach to distributed training. The idea is that we have a lot of training data so we batch it up, and send blocks of data to multiple CPUs or GPUs (nodes) to be processed by the neural network [1]. The number of blocks/partitions of data is equal to the number of available nodes in the compute cluster. The model is copied to each of these worker nodes, and each worker operates on its own subset of the data. We need to understand that each node has to have enough computing power to be able to support the model being trained, i.e. the model has to fit entirely on each node [2]. Each node has a copy (exact replica) of the model and independently computes the forward pass, backward pass, gradients for its respective data batch. The weight updates are shared with the other nodes for synchronization before moving on to the next batch and ultimately another epoch.

Data Parallelism: The same model is replicated and shared across the different worker nodes. Each worker gets access to a distinct subset of data to train on. The parameter updates after every step are aggregated across the different nodes via communication. (Source: By the Author)
Data Parallelism: The same model is replicated and shared across the different worker nodes. Each worker gets access to a distinct subset of data to train on. The parameter updates after every step are aggregated across the different nodes via communication. (Source: By the Author)

SageMaker Distributed Data Parallel Library: AWS SageMaker API allows you to perform data parallelism distributed training easily without having to modify your scripts a lot. It handles the creation of clusters for you. It also addresses the communications overhead required in this kind of training in two ways:

  1. The library performs AllReduce, a key operation during distributed training that is responsible for a large portion of the communication overhead. It is the algorithm that takes care of aggregating the gradients together using some operation and relaying the results to all workers. You can read more about it here:
  2. Technologies behind Distributed Deep Learning: AllReduce | Preferred Networks Research & Development
  3. The library performs optimized node-to-node communication by fully utilizing AWS’s network infrastructure and Amazon EC2 instance topology.

Following is a benchmark of AWS Sagemaker and PyTorch’s data parallel library. (I’ve not personally validated these numbers though.) We do observe a speedup being shown for the AWS DDP library.

Source: Introduction to SageMaker's Distributed Data Parallel Library - Amazon SageMaker
Source: Introduction to SageMaker’s Distributed Data Parallel Library – Amazon SageMaker

Model parallelism

In model parallelism (network parallelism) the model is split into different parts that can run concurrently in different nodes, and each one will run on the same data. Its scalability depends on the degree of task parallelization of the algorithm and is more complex to implement than data parallelism. Worker nodes only need to synchronize the shared parameters, usually once for each forward and backward propagation step. It is important to consider few factors in this training [1].

Model Parallelism: Model is split into different partitions either manually or automatically based on certain algorithm. Every partition is given the same batch of data. The execution of the model partitions is based on a scheduling policy. Activations/Layer outputs are communicated amongst the partitions depending on the computational graph relationship. (Source: By the Author)
Model Parallelism: Model is split into different partitions either manually or automatically based on certain algorithm. Every partition is given the same batch of data. The execution of the model partitions is based on a scheduling policy. Activations/Layer outputs are communicated amongst the partitions depending on the computational graph relationship. (Source: By the Author)
  1. How you split your model across devices: The computational graph (Computational graphs in PyTorch and TensorFlow) of your model, sizes of model parameters and activations, and your resource constraints (e.g. memory vs time) determine the best partitioning strategy. To reduce the time and effort required to efficiently split your model, you can use automated model splitting features offered by Amazon SageMaker’s distributed model parallel library.
  2. Achieving parallelization: In Deep Learning, the model training is highly sequential in the sense that we run the forward computations and backward propagation to get the gradients. Each operation must wait for its inputs to be computed for another operation. Therefore, forward and backward pass stages of deep learning training are not easily parallelizable, and naively splitting a model across multiple GPUs may lead to poor device utilization. For example, layer on GPU i+1 has to wait for the output from a flyer on GPU i, and as a result, GPU i+1 remains idle during this waiting period. The model parallel library can achieve true parallelization by implementing pipelined execution by building an efficient computation schedule where different devices can work on forward and backward passes for different data samples at the same time.

To further read about these strategies, I would encourage you to read Scalable Deep Learning on Parallel and Distributed Infrastructures by Jordi TORRES.AI.

AWS’s SageMaker’s distributed model parallel [5]: It makes model parallelism more accessible by providing automated model splitting and sophisticated pipeline execution scheduling. The model splitting algorithms can optimize for speed or memory consumption. The library also supports manual partitioning. This functionality is inbuilt in the PyTorch/TensorFlow estimator class.

Automated Model Splitting: As the name suggests the library handles the model splitting for you. It uses a partitioning algorithm that balances memory, minimizes communication between devices, and optimizes performance. You can configure the partitioning algorithm to optimize for speed or memory. The auto partitioning happens during the first training step when smp.step decorated function is first called. During the call, the library first constructs a version of the model on the CPU RAM and then analyses the model graph to make the partitioning decision. Based on the decision, each model partition is loaded on a GPU, and only the first step is executed.

Manual Model Splitting: If you want to manually specify how the model is partitioned across the different devices, manual splitting is also possible. You can use smp.partition context managers for this.

Pipeline Execution Schedule: A fundamental feature of SageMaker’s DMP library is pipelined execution, which determines the order in which computations are done and data is processed across devices during training. It is a technique for achieving true parallelization in model parallelism, by having GPUs compute simultaneously on different data samples and to overcome the performance loss due to sequential computation. Pipelining is based on splitting a mini-batch into micro-batches, which are fed into the training pipeline one by one and follow an execution schedule defined by the library runtime. There are two types of pipelines:

  1. Interleaved Pipeline: In this pipeline, backward execution of micro-batches is prioritized whenever possible. It allows for quicker release of the memory used for activations, using memory more efficiently.
  2. Simple Pipeline: In this pipeline, the forward pass for each microbatch is finished before starting the backward pass. This means that it only pipelines the forward pass and backward pass stages within themselves.

Conclusion

We saw that data parallelism and model parallelism are two major types of Distributed Training techniques available for deep learning. Data parallelism is easier than model parallelism. AWS SageMaker has options for both which is nicely integrated into the SageMaker PyTorch/TensorFlow estimator. I will take an example of how to do this in code in another article. Hope you liked the article. Follow for more interesting articles. If you have any favorites for doing your distributed training, please do share that in the comments. You can connect with me on LinkedIn.

References

[1] https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html#distributed-training-get-started

[2] What is distributed training? – Azure Machine Learning

[3] PyTorch Distributed Overview – PyTorch Tutorials 1.9.0+cu102 documentation

[4] Understanding Data Parallelism in Machine Learning

[5] Core Features of SageMaker Distributed Model Parallel – Amazon SageMaker


Related Articles