Making Sense of Big Data
After graduating from the sandpit dream-world of MNIST and CIFAR it’s time to move to Imagenet experiments. Perhaps you too are standing and staring at that million-plus dataset, asking from which direction you should approach the beast. Here I’ll give the step-by-step approach I took in the hope it helps you wrestle with the monster.
First, the warning:
Do not underestimate the compute needed for running ImageNet experiments: Multiple GPU’s + Multiple-hours per experiment are often needed.
If you’re reading this line then you’ve decided you have enough compute and patience to continue, let’s look at the core steps we need to take. My approach uses multiple GPUs on a compute cluster using SLURM (my university cluster), Pytorch, and Lightning. This tutorial assumes a basic ability to navigate them all ❤
The Key Steps
1.Set-up DDP in Lightning
- Access the ImageNet dataset
- Bash script instructions to Slurm
Setting up DDP in Lightning
Wait, what is DDP?
Good question, DDP stands for Distributed Data-Parallel and is a method to allow communication between different GPU’s and different Nodes within a cluster that you’ll be running. There are lots of options for doing this, but we’re only going to cover DDP since it is recommended and implemented out-the-box with Lightning.
DDP trains a copy of the model on each of the GPUs you have available and breaks up a mini-batch into exclusive slices for each GPU. The forward pass is pretty simple. Each GPU predicts on its sub-mini-batch and the predictions are merged. Simples.

The backward pass is a bit more tricky. The non-distributed version of DDP (called, you guessed it, DP) requires you to have a ‘master node’ that collected all the outputs, calculated the gradient, and then communicated this to all of the models.
But, DDP says no to the centralised bureaucracy. Instead, each GPU is responsible for sending the model weight gradients – calculated using its sub-mini-batch – to each of the other GPUs. Upon receiving a full set of gradients, each GPU aggregates the results. The outcome? Each model copy on each GPU has the same update. The name for this is an all-reduce operation.

Ok, so tell me Lightning helps me set this up?
Yes, it certainly does. I’m assuming you have a bit of Lightning experience reader, so will just concentrate on the key things to do:
Synchronise logging
Just like making sure that the gradient updates are the same, you also need to update any metric logging you have to account for the need to communicate. If you don’t, your accuracy will be GPU dependent based only on the subset of data that GPU sees.
It’s pretty simple to convert for multiple GPUs. Just add ‘sync_dist = True’ to all of your logs.
Setting GPU device and DDP backend
Now we need to update our trainer to match the number of GPUs we’re using. Or, you could just let Lightning figure out how many you’ve got and set the number of GPUs to -1.
As mentioned earlier, I’m using DDP as my distributed backend so set my accelerator as such. Nothing much to do here >>
trainer = Trainer(gpus=-1, accelerator='ddp')
That’s it for the Python code. Depending on how you set-up your model you might need to also remove any .to() or .cuda() calls – which will cause issues.
If you hit any snags: https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html
ImageNet
Lightning code ready, it’s time to grab ImageNet. The dataset is no longer quite as simple to download as it once was via torchvision. Instead, I’ll give the two options I found that worked
Long version
Clicking on the above and requesting access. This can take a few days before it’s granted for non-commercial uses.
Short version
Go to the Kaggle, join the comp and download the data using the below bash command.
kaggle competitions download -c imagenet-object-localization-challenge
In both cases, when downloading to your cluster instance you’ll likely want to download to scratch rather than your main filespace since, well, ImageNet is a beast and will soon overrun even the most generous storage allowance.
Creating your lightning data module
You can connect this data module in the same way you would with others so that training becomes something along the lines of:
data = ImageNet_Module()
model = YourModel()
trainer = Trainer(gpus=-1, accelerator='ddp', max_epochs=90)
trainer.fit(model, data)
trainer.test()
Of course, you’ll want to put this into a nice Python file with all the bells, whistles, and custom models you want ready to be called by the bash script. Ok, I think we’re ready for the final piece of glue, the SLURM script.
SLURM script
At this point, all the hard work is done. I’ll give my example script that I run on my university cluster as an example below:
Of course, you’ll be constrained by the resources and limits you have allocated, but this should help to give a basic outline to get you started. To use this outline you’ll need to have set up your conda environment and installed the libraries you require on the cluster.
That’s all folks
Ok that’s a wrap.
If all has gone to plan you should now be in the process of training.
For my setup, an out-the-box ResNet18 model using 4x RTX 8000 takes approximately 30 mins per epoch with a batch-size of 128.
This has been an n=1 example of how to get going with ImageNet experiments using SLURM and Lightning so am sure snags and hitches will occur with slightly different resources, libraries, and versions but hopefully, this will help you in getting started taming the beast.
Thank you for reading ❤
The Tools used
- Pytorch (1.7)
- Pytorch Lightning (1.2)
- SLURM manager (Uni compute cluster)
- 4 pristine Quadro RTX 8000’s