Deep Learning At Scale: Parallel Model Training

Concept and a Pytorch Lightning example

Caroline Arnold
Towards Data Science
7 min readApr 5, 2024

--

Eight parallel neon bulbs in rainbow colors against a dark background.
Image created by the author using Midjourney.

Parallel training on a large number of GPUs is state of the art in deep learning. The open source image generation algorithm Stable Diffusion was trained on a cluster of 256 GPUs. Meta’s AI Research SuperCluster contains more than 24,000 NVIDIA H100 GPUs that are used to train models such as Llama 3.

--

--

AI Consultant, PhD in Physics. I write about artificial intelligence, data analysis, science, and diversity. https://medium.com/visual-data-science