
Parallel training on a large number of GPUs is state of the art in Deep Learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.
By using multiple GPUs, machine learning experts reduce the wall time of their training runs. Training Stable Diffusion took 150,000 GPU hours, or more than 17 years. Parallel training reduced that to 25 days.
There are two types of parallel deep learning:
- Data parallelism, where a large dataset is distributed across multiple GPUs.
- Model parallelism, where a deep learning model that is too large to fit on a single GPU is distributed across multiple devices.
We will focus here on data parallelism, as model parallelism only becomes relevant for very large models beyond 500M parameters.
Beyond reducing wall time, there is an economic argument for parallel training: Cloud compute providers such as AWS offer single machines with up to 16 GPUs. Parallel training can take advantage of all available GPUs, and you get more value for your money.
Parallel computing
Parallelism is the dominant paradigm in high performance computing. Programmers identify tasks that can be executed independently of each other and distribute them across a large number of devices. The serial parts of the program distribute the tasks and gather the results.
Parallel Computing reduces computation time. A program parallelized across four devices could ideally run four times faster than the same program running on a single device. In practice, communication overhead limits this scaling.
As an analogy, think of a group of painters painting a wall. The communication overhead occurs when the foreman tells everyone what color to use and what area to paint. Painting can be done in parallel, and only the finishing is again a serial task.
Training deep learning models
Training a deep learning algorithm requires data and an optimization routine like stochastic gradient descent.
The training data is shuffled and split into batches of fixed size. Batch by batch, the training routine calculates the loss between the actual and predicted labels. The model parameters are adjusted according to the gradient of the loss with respect to the model parameters.

An epoch of training is complete when the model has seen all training data batches once.
Distributed data parallelism
The following diagram describes data parallelism for training a deep learning model using 3 GPUs. A replica of the untrained model is copied to each GPU. The data set is split into 3 parts, each part is processed in parallel on a separate GPU.

Note that each model replica sees different subsets of the training data. After completing the epoch, the weights and biases in each model replica are different.
The three model replicas are harmonized by averaging the weights across all model instances. The updated model is broadcast across all 3 GPUs, and training continues with the next epoch.
Distributed training changes the effective batch size to number of devices * original batch size
. This can affect training, convergence, and model performance.
Implementation in Pytorch Lightning
We need several ingredients for data parallelism:
- A dataloader that can handle distributed training
- An all-reduce function that harmonizes the model replicas
- A framework for the different parallel parts to communicate with each other
In Pytorch Lightning, the Lightning Trainer handles the entire training process. It can perform distributed data parallel training out of the box by specifying the following keyword arguments:
- Number of nodes – the machines in the cloud compute cluster that you want to use
- Number of devices (GPUs) per node
- Training strategy, here distributed data parallelism. Different strategies are available for distributed training, they take care of adjusting the training procedure.
For distributed data parallelism (ddp) training on 2 nodes with 2 GPUs each, use:
import lightning as L
trainer = L.Trainer(nodes=2, devices=2, strategy='ddp')
trainer.fit(model, dataloader)
By default, the Lightning Trainer uses all available GPUs in the context of the computing environment. This implies that you need to specify explicitly if you want to use only a single GPU on a multi-GPU machine.
Experiment: Parallel training with the Country211 dataset
The Country211 dataset consists of geo-tagged images from 211 countries. In my experiment, I use a pre-trained vision transformer and fine tune this model on the Country211 dataset. The code is available on GitHub.
How to Fine-Tune a Pretrained Vision Transformer on Satellite Data
There are 31,650 training images. I used a batch size of 256 throughout and experimented with the number of GPUs, going from a single GPU on a single node up to eight GPUs distributed across four nodes. I applied an early stopping condition, where training stops if the validation accuracy does not improve for five epochs.
The training wall time, which is the total time it took to train the model, decreases with the number of GPUs used. On a single GPU, the wall time is 45 minutes, which drops to 11 minutes when I use eight GPUs.
The ideal scaling would be 45 minutes / 8 GPUs = 5.6 minutes, but communication overhead and a larger number of epochs when training on more GPUs keep us from reaching this optimum.

To compare the generalization capabilities, I calculated the test set accuracy with all trained models. The best test set accuracy was obtained when using a single GPU on a single device, but even there, the model had issues with generalization: The test set accuracy is only 15%.
The panel shows the decrease in accuracy, relative to the optimal result, for different numbers of GPUs. The more devices were used in training, the less accurate the model became when applied to the test set.

The changing accuracy could be due to the changing effective batch size – note that we do not get exactly the same model depending on the training strategy.
In this experiment, there is a tradeoff between generalization accuracy and speed. Training on 8 GPUs was four times faster, but 11% less accurate than training on a single device.
A look at parallel training
In the training log file, you will see lines like these:
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
The parallel training processes must be able to communicate with each other. The communication backend, here the NVIDIA Collective Communications Library (nccl), is automatically initialized depending on the platform.
We have requested two nodes with two GPUs each, for a total of four parallel processes. They are enumerated by their global rank (0–3). On each node, the local rank identifies the GPU that is used (0 or 1).
When the model is large, the all-reduce operation can take a long time. Compute clusters connect the GPUs with efficient networks, but sending a full model replica out puts a strain on communication.
Model parallelism
Lightning provides strategies for model parallelism. Note that this should only be explored for very large models with billions of parameters, where the model parameters and gradients exceed GPU memory.
Summary
Parallel training can speed up the training process and help you make the most of your computational resources.
We distinguish between data parallelism, where replicas of the model see subsets of the data during training, and model parallelism, where the model itself is distributed across multiple devices.
With PyTorch Lightning, data parallel training is as simple as specifying the distributed data parallelism strategy and the number of nodes and devices (GPUs) per node.
Data parallel training reduces the wall time and changes the effective batch size. Models trained on different numbers of devices may perform differently, and careful evaluation is recommended.
Model parallel training should only be considered for very large models exceeding 500M parameters.
The code to replicate the experiment can be found on my GitHub:
GitHub – crlna16/pretrained-vision-transformer: Pretrained Vision Transformer with PyTorch…
References
- Tal Ben-Nun et al, "Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis", 2018, https://arxiv.org/abs/1802.09941
- OpenAI Country211 dataset: source (subset of YFCC100M) and license
- Vision transformer model: Dosovitskiy et al, An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale, ICLR, 2021. Paper and code
- PyTorch Lightning project for data-parallel training: docs
- Will Falcon on distributed training (TDS, 2019)