The world’s leading publication for data science, AI, and ML professionals.

Deep Learning At Scale: Parallel Model Training

Concept and a Pytorch Lightning example

Image created by the author using Midjourney.
Image created by the author using Midjourney.

Parallel training on a large number of GPUs is state of the art in Deep Learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.

By using multiple GPUs, machine learning experts reduce the wall time of their training runs. Training Stable Diffusion took 150,000 GPU hours, or more than 17 years. Parallel training reduced that to 25 days.

There are two types of parallel deep learning:

  • Data parallelism, where a large dataset is distributed across multiple GPUs.
  • Model parallelism, where a deep learning model that is too large to fit on a single GPU is distributed across multiple devices.

We will focus here on data parallelism, as model parallelism only becomes relevant for very large models beyond 500M parameters.

Beyond reducing wall time, there is an economic argument for parallel training: Cloud compute providers such as AWS offer single machines with up to 16 GPUs. Parallel training can take advantage of all available GPUs, and you get more value for your money.


Parallel computing

Parallelism is the dominant paradigm in high performance computing. Programmers identify tasks that can be executed independently of each other and distribute them across a large number of devices. The serial parts of the program distribute the tasks and gather the results.

Parallel Computing reduces computation time. A program parallelized across four devices could ideally run four times faster than the same program running on a single device. In practice, communication overhead limits this scaling.

As an analogy, think of a group of painters painting a wall. The communication overhead occurs when the foreman tells everyone what color to use and what area to paint. Painting can be done in parallel, and only the finishing is again a serial task.

Training deep learning models

Training a deep learning algorithm requires data and an optimization routine like stochastic gradient descent.

The training data is shuffled and split into batches of fixed size. Batch by batch, the training routine calculates the loss between the actual and predicted labels. The model parameters are adjusted according to the gradient of the loss with respect to the model parameters.

Stochastic gradient descent. Image created by the author.
Stochastic gradient descent. Image created by the author.

An epoch of training is complete when the model has seen all training data batches once.


Distributed data parallelism

The following diagram describes data parallelism for training a deep learning model using 3 GPUs. A replica of the untrained model is copied to each GPU. The data set is split into 3 parts, each part is processed in parallel on a separate GPU.

Distributed data, same model. Image created by the author.
Distributed data, same model. Image created by the author.

Note that each model replica sees different subsets of the training data. After completing the epoch, the weights and biases in each model replica are different.

The three model replicas are harmonized by averaging the weights across all model instances. The updated model is broadcast across all 3 GPUs, and training continues with the next epoch.

Distributed training changes the effective batch size to number of devices * original batch size . This can affect training, convergence, and model performance.


Implementation in Pytorch Lightning

We need several ingredients for data parallelism:

  • A dataloader that can handle distributed training
  • An all-reduce function that harmonizes the model replicas
  • A framework for the different parallel parts to communicate with each other

In Pytorch Lightning, the Lightning Trainer handles the entire training process. It can perform distributed data parallel training out of the box by specifying the following keyword arguments:

  • Number of nodes – the machines in the cloud compute cluster that you want to use
  • Number of devices (GPUs) per node
  • Training strategy, here distributed data parallelism. Different strategies are available for distributed training, they take care of adjusting the training procedure.

For distributed data parallelism (ddp) training on 2 nodes with 2 GPUs each, use:

import lightning as L

trainer = L.Trainer(nodes=2, devices=2, strategy='ddp')
trainer.fit(model, dataloader)

By default, the Lightning Trainer uses all available GPUs in the context of the computing environment. This implies that you need to specify explicitly if you want to use only a single GPU on a multi-GPU machine.

Experiment: Parallel training with the Country211 dataset

The Country211 dataset consists of geo-tagged images from 211 countries. In my experiment, I use a pre-trained vision transformer and fine tune this model on the Country211 dataset. The code is available on GitHub.

How to Fine-Tune a Pretrained Vision Transformer on Satellite Data

There are 31,650 training images. I used a batch size of 256 throughout and experimented with the number of GPUs, going from a single GPU on a single node up to eight GPUs distributed across four nodes. I applied an early stopping condition, where training stops if the validation accuracy does not improve for five epochs.

The training wall time, which is the total time it took to train the model, decreases with the number of GPUs used. On a single GPU, the wall time is 45 minutes, which drops to 11 minutes when I use eight GPUs.

The ideal scaling would be 45 minutes / 8 GPUs = 5.6 minutes, but communication overhead and a larger number of epochs when training on more GPUs keep us from reaching this optimum.

Wall time for different numbers of devices. Image created by the author.
Wall time for different numbers of devices. Image created by the author.

To compare the generalization capabilities, I calculated the test set accuracy with all trained models. The best test set accuracy was obtained when using a single GPU on a single device, but even there, the model had issues with generalization: The test set accuracy is only 15%.

The panel shows the decrease in accuracy, relative to the optimal result, for different numbers of GPUs. The more devices were used in training, the less accurate the model became when applied to the test set.

Test set accuracy. Image created by the author.
Test set accuracy. Image created by the author.

The changing accuracy could be due to the changing effective batch size – note that we do not get exactly the same model depending on the training strategy.

In this experiment, there is a tradeoff between generalization accuracy and speed. Training on 8 GPUs was four times faster, but 11% less accurate than training on a single device.

A look at parallel training

In the training log file, you will see lines like these:

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

The parallel training processes must be able to communicate with each other. The communication backend, here the NVIDIA Collective Communications Library (nccl), is automatically initialized depending on the platform.

We have requested two nodes with two GPUs each, for a total of four parallel processes. They are enumerated by their global rank (0–3). On each node, the local rank identifies the GPU that is used (0 or 1).

When the model is large, the all-reduce operation can take a long time. Compute clusters connect the GPUs with efficient networks, but sending a full model replica out puts a strain on communication.

Model parallelism

Lightning provides strategies for model parallelism. Note that this should only be explored for very large models with billions of parameters, where the model parameters and gradients exceed GPU memory.


Summary

Parallel training can speed up the training process and help you make the most of your computational resources.

We distinguish between data parallelism, where replicas of the model see subsets of the data during training, and model parallelism, where the model itself is distributed across multiple devices.

With PyTorch Lightning, data parallel training is as simple as specifying the distributed data parallelism strategy and the number of nodes and devices (GPUs) per node.

Data parallel training reduces the wall time and changes the effective batch size. Models trained on different numbers of devices may perform differently, and careful evaluation is recommended.

Model parallel training should only be considered for very large models exceeding 500M parameters.

The code to replicate the experiment can be found on my GitHub:

GitHub – crlna16/pretrained-vision-transformer: Pretrained Vision Transformer with PyTorch…


References

  • Tal Ben-Nun et al, "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis", 2018, https://arxiv.org/abs/1802.09941
  • OpenAI Country211 dataset: source (subset of YFCC100M) and license
  • Vision transformer model: Dosovitskiy et al, An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, ICLR, 2021. Paper and code
  • PyTorch Lightning project for data-parallel training: docs
  • Will Falcon on distributed training (TDS, 2019)

Related Articles