Smart Distributed Training on Amazon SageMaker with SMD: Part 1

How Choosing a Distribution Algorithm that is Aligned with the Capabilities of your Training Instances can Increase Throughput and Reduce Cost

Published in

Towards Data Science

8 min readSep 20, 2022

A critical step in optimizing the runtime performance of your training jobs is tuning your algorithms so as to maximize the utilization of the resources in your training environment. This requires a thorough understanding of your resources, (the number and types of computation devices, the available memory, communication bandwidths, etc.) as well as appropriate tools for analyzing their utilization. Tuning your training algorithms to take best advantage of your resources can increase your training speed and reduce your training costs. This is true, in particular, of your distributed training algorithms which rely heavily on high-speed communication between multiple processes running on multiple processors (e.g., GPUs). Failure to account for the specifics of the training environment can result in communication bottlenecks, high latency, decreased training speed, and higher costs.

This is the first part of a three-part post on the topic of optimizing distributed training. In part one we will briefly review the different methods for performing distributed training while emphasizing their dependence on the underlying training environment. In parts two and three we will show two examples for how the choice of the distributed training method and algorithm can impact the training speed. There are many different popular frameworks for running distributed training. In this post we will use Amazon SageMaker’s Distributed Training Libraries (SMD) for our demonstrations. Although our focus will be on the Amazon SageMaker environment, the content of this post is just as applicable to any other training environment and distribution algorithm. The framework choices made in this post should not be viewed as an endorsement. The right choices for you will be highly dependent on the details of your project and should take into account the most up to date techniques and utilities available.

A Brief Overview of Distributed Training

For a more detailed overview on distributed training see here and here.

In a distributed training job, training is performed on multiple workers. In this post we will assume that the workers are GPUs. There are two primary reasons for distributing your training job:

Training Acceleration: By combining the power of multiple GPUs we can speed up the training process.
Large Model Size: When a model is too large to fit into the memory of a single GPU, we require multiple GPUs in order to train.

There are two types of distributed training, data parallel training (a.k.a. data distribution) and model parallel training (a.k.a. model distribution).

In data parallel distributed training, each GPU maintains its own copy of the full model and performs each training step on a different subset (local batch) of the training data. After each training step it publishes its resultant gradients and updates its own model taking into account the combined knowledge learned by all of the GPUs. Denoting the number of GPUs by k and the local batch size by b, the result of performing distributed training on k GPUs is that at each training step, the model trains on a global batch size of k*b samples. Ideally, performing data distributed training on k GPUs would accelerate the speed of training by a multiple of k. However, such a linear-scale speed-up should not be taken for granted due to the additional overhead of gradient sharing. The actual training speed-up will depend on a number of factors including the inter-GPU communication bandwidth, the algorithm used for sharing gradients, and the model architecture. Check out this blog post for more info on data distributed training.

In model parallel distributed training the model is distributed across several GPUs. There are several ways in which the model can be distributed including: vertically (pipeline parallelism), horizontally (tensor parallelism), and through model sharding.

In a pipeline parallel solution, a single copy of the model will be divided into N portions, each of which will contain one or more layers. Each portion will be stored on one of the GPUs. The training data will be entered to the GPU that hosts the input layer and will flow forward through the N portions of the model and then backward for the gradient calculation. During the forward and backward passes, the data flow between model portions is communicated across the hosting GPUs. One of the main challenges of pipeline parallelization is trying to reduce GPU idle time. In a naïve implementation, you would find that only one GPU would be active at any given point during the training step. In practice, modern pipelining algorithms will divide the input samples into micro-batches and use elaborate scheduling algorithms to reduce idle time.

In a tensor parallel solution, sometimes referred to as tensor slicing, a subset of the model’s tensors will be divided across GPUs, and appropriate communication operations will be added to transfer input and output data across GPUs. Note that although categorized as a model parallel technique, tensor parallelism shares some properties with data parallelism, namely that each GPU is fed its own unique mini batch of data and that tensors that are not parallelized are duplicated. See here for more details on how tensor parallelism works.

In sharded data parallelism, sometimes called ZeRO powered data parallelism, the model parameters are sharded across all of the GPUs. Each parameter will reside on and be owned and updated by a single GPU. As in standard data parallelism, the full training step will be performed on each GPU on an independent mini batch. When a GPU reaches parameter_i which is stored on gpu_j it will need to extract its weight from gpu_j to perform the required calculations but it will then immediately delete the weight so that it does not take up any local memory. This occurs in both the forward and backward pass. Once the gradient updates are calculated, the GPU needs to communicate them to the owners of their respective parameters. Note that this method is sometimes (e.g., here) categorized as a data parallel method rather than a model parallel method.

3D parallelism is a term introduced in this paper to refer to any combination of the data parallel and model parallel strategies discussed above. The objective of 3D parallelism is to combine the strengths of the individual techniques.

Common to all of the methods is that data is passed between GPUs. However, the number and details of the communication paths between the GPUs, the number of data hops per training step, and the amount of data that is passed, can vary greatly.

Now you might be thinking to yourself, “Wow, what a great overview of all the many ways to perform distributed training but how the heck am I supposed to know which one to use?”. I want you to know that I feel you, my friend. The truth is that figuring this out is no simple matter. (I want to say that it is NP-hard, but I don’t have a source at hand to back that up.) The optimal choice will depend on the details of your model and on the details of your underlying training environment. If your model size is small enough to fit into the memory of a single GPU, then you are likely to opt for a data parallel distributed algorithm. If your model is too large to fit into a single GPU but the communication bandwidth between GPUs is especially small, then you may find the best option to be pipeline parallelization.

There are a number of resources (such as here and here) that attempt to provide guidance on making this decision based on the details of the model and the environment. Some libraries, including SMD, will automate some of the specific configurations once a high-level strategy is chosen. As of the time of this writing, there are no APIs (that we are aware of) that will automate the full process of finding and building an optimal distributed training topology. We can only hope that somewhere, someone is working on such a solution.

In the next section we will dive into one specific aspect of the underlying training environment that should be taken into consideration when choosing a distributed training method and algorithm.

Some GPU-to-GPU Links are not Like the Others

As noted above, all distributed training algorithms rely on the communication of data between GPUs. This data could be parameter weights, gradients, and/or activations. The communication could be direct or via some form of intermediary (e.g., a parameter server). It could be over several different mediums such as ethernet, pci, or NVLink. On most modern training systems, NVLink supports the highest rates of data transfer between GPUs on the same instance. As a result, many distributed training algorithms will opt for direct GPU-to-GPU data transfer. In many cases — certainly if all GPUs are on a single instance — this is indeed likely to be the optimal choice. However, when training on multiple instances, each of which have multiple GPUs, it is important to be aware of the fact that the speed and latency of the data transfer between two GPUs on two different instances can be very different than on two GPUs on the same instance. This is for several reasons, including:

The network bandwidth between compute instances tends to be much lower than the bandwidth of the direct GPU-to-GPU link on a single instance.
The network bandwidth is shared with other types of data communication unrelated to the distributed training algorithm such as training sample loading, checkpoint saving, etc.
The distance between the instances can incur a certain amount of latency which could have a negative impact on the speed of the distributed training algorithm. (Note that as of the time of this writing, the SageMaker APIs do not enable you to force a single cluster placement group for all of the instances.)

In some cases, you may have a dedicated network interface for boosting inter-node communication such as Amazon EFA. However, it is still no match for the single-instance direct GPU-to-GPU link.

Ideally, we would like our distributed training algorithm to account for this distinction between the different types of GPU-to-GPU connections. We would expect such an approach to be superior to a naïve approach in which all GPU-to-GPU connections are treated the same way. In the next sections of the port we will test this out in two ways, first using SageMaker distributed data parallel (SDP) and then using SageMaker distributed model parallel (SMP).

Next Steps

Be sure to check out the second part of the post in which we will demonstrate how Amazon SageMaker’s distributed data parallel library supports data distribution in a way that distinguishes between intra-node and inter-node GPU pairs.

Smart Distributed Training on Amazon SageMaker with SMD: Part 1

How Choosing a Distribution Algorithm that is Aligned with the Capabilities of your Training Instances can Increase Throughput and Reduce Cost

A Brief Overview of Distributed Training

Some GPU-to-GPU Links are not Like the Others

Next Steps

Written by Chaim Rand