Deep Learning At Scale: Parallel Model Training
Concept and a Pytorch Lightning example
Published in
7 min readApr 5, 2024
Parallel training on a large number of GPUs is state of the art in deep learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.