Making Sense of Big Data
Why Parallelized Training Might Not be Working for You
A guide to the motivations, benefits and caveats of parallelized training of Neural Networks
Introduction
There are a few methods for parallelized training of Neural Networks.
- Inter-Model Parallelism a.k.a Parallelized Hyperparameter Search
- Data Parallelism
- Intra-Model Parallelism a.k.a Model Parallelism
- Pipelined Parallelism
In this post we will explore the two most common methods for parallelism, Inter-Model Parallelism and Data Parallelism. We will use a simple CNN model on the CIFAR10 dataset to demonstrate these techniques. We’re going to train this model on a 4-GPU machine running the AWS Deep Learning AMI with Pytorch (1.7.1)
Advantages of Parallelized Training
The most obvious advantage of parallelized training is speed. In the case of a hyperparameter search, simultaneously evaluating multiple configurations allows us to quickly narrow down the most promising options.
With distributed data parallel (DDP) training, a copy of the model’s parameters are placed on each available GPU and each copy is fed a different subset of the entire dataset.
After every batch evaluation, the gradients of the the copies are synced and averaged. The weights of the copies of the model are updated based on these synced gradients.
This increases the effective batch size being used by the model, allowing it to train on batches that are larger than what would fit in memory on a single GPU. As the size of training datasets increase, DDP serves as a way to keep training times reasonable.
Significance of Batch Size
Using larger batch sizes results in faster training since the network is able to iterate over the entire dataset in fewer steps. However, empirical evidence suggests that larger batch sizes tend to converge to sharp minima in the loss surface, which lead to poor generalization [1]. In contrast, smaller batch sizes result in wider, flatter minima that have good generalization capabilities.
Let’s test this out by running a hyperparameter search over a set of candidate batch sizes. We will train our CNN for 10 epochs over CIFAR10 using a constant learning rate of 0.001.
We’re going to run the Comet Optimizer in Parallel and feed in an Optimizer Config file as a command line argument.
comet optimize -j 4 comet-pytorch-parallel-hpo.py optim.config
Here j
is the number of parallel processes we want to start. You can find more details about the Comet Optimizer here
The optim.config
file is just a JSON object that contains our parameter search grid.
{
# We pick the Bayes algorithm:
"algorithm": "random",
# Declare your hyperparameters in the Vizier-inspired format:
"parameters": {
"batch_size": {
"type": "discrete",
"values": [8, 32, 128, 512]
},
},
# Declare what we will be optimizing, and how:
"spec": {
"metric": "val_loss",
"objective": "minimize",
},
}
When looking at the line chart, we see that larger batch sizes lead to a shorter runtime for training. However, the larger batch sizes also result in a much worse train error, and test accuracy. In fact, the smaller the batch size, the better the test accuracy. This is a problem, since the smallest batch size takes nearly 8X longer to complete.
The smallest batch size takes nearly 8X longer to complete.
Scaling the Learning Rate
A key aspect of using large batch sizes involves scaling the learning rate. A general rule of thumb is to follow a Linear Scaling Rule [2]. This means that when the batch size increases by a factor of K the learning rate must also increase by a factor of K.
Let’s investigate this in our hyperparameter search. Let’s linearly scale our learning rate depending on the batch size. We will use a batch size of 8 as our scaling constant.
That seemed to do the trick. Scaling the learning rate based on batch size allows us to mitigate the generalization gap! Let’s combine this idea with our distributed data parallel approach.
Using a Larger Effective Batch Size
With DDP training the dataset is divided amongst the number of available GPUs.
Lets run a set of experiments with using the Pytorch Distributed Data Parallel Module. The Module handles copying the model to each GPU as well as synchronizing the gradients and updating the weights across GPU processes.
Using the ddp module is quite straight forward. Wrap your existing model within the DDP module, and assign it to a GPU
model = Net()
model.cuda(gpu_id)
ddp_model = DDP(model, device_ids=[gpu_id])
We will use the DistributedSampler object to ensure that the data is distributed properly across each GPU processes
# Load training data
trainset, testset = load_data()
test_abs = int(len(trainset) * 0.8)
train_subset, val_subset = random_split(
trainset, [test_abs, len(trainset) - test_abs]
)
train_sampler = torch.utils.data.distributed.DistributedSampler(
train_subset, num_replicas=world_size, rank=global_process_rank
)
trainloader = torch.utils.data.DataLoader(
train_subset,
batch_size=PER_REPLICA_BATCH_SIZE,
sampler=train_sampler,
num_workers=8,
)
valloader = torch.utils.data.DataLoader(
val_subset, batch_size=PER_REPLICA_BATCH_SIZE, shuffle=True, num_workers=8
)
Finally run the training loop using the DDP wrapped model
for epoch in range(args.epochs):
train_loss = train(ddp_model, optimizer, criterion, trainloader, epoch, gpu_id)
val_loss, val_acc = evaluate(ddp_model, criterion, valloader, epoch, gpu_id)
In our case, each GPU receives a quarter of the dataset. When we run training in this manner our effective batch size is the product of the number of GPUs and the batch size per GPU. So when we set a batch size per GPU of 8, our effective batch size is actually 32. We can verify this by comparing the DDP training run to a single GPU training run with batch size 32. Notice how the curves look similar, and run for a similar number of training steps. These curves show that even though we are using a smaller batch size per GPU process, our model performance is still dependent on the effective batch size.
Putting it All Together
Lets combine what we’ve learned about learning rate scaling and rerun DDP training.
It seems that even learning rate scaling has its limits. Training improves as we increase our batch size up to a certain value, after which we see our test accuracy starting to deteriorate.
Caveats
Theoretically, distributing the training over multiple GPUs should speed up the overall time taken to train a model. In practice, this is not always the case. Synchronizing the gradients across multiple GPU processes incurs a communication overhead that is not trivial to minimize.
In the Panel below we compare the runtime duration for a single process training run with batch size 32 to a DDP training run with the same effective batch size. The single process run takes 73 seconds to complete, while the DDP training run is almost eight times slower, taking 443 seconds to complete.
This is likely due to the fact that the gradients are being synchronized every time we call loss.backward()
in our training code. The constant communication between processes causes the overall runtime to increase. This is something to be mindful of when setting up DDP training. If you are running your experiments in a multi-machine setting, ensure that the network bandwidth is sufficient to handle each process sending and receiving the entire model's checkpoint data.
Another way to speed up this process is to change the synchronization schedule of the gradients, but that is out of scope for this report.
Conclusion
We hope you found this introductory guide to parallelized training useful. In subsequent posts we will go over how to optimize your code for distributed training in order to get the most out of your compute infrastructure.
Read the full report with interactive visualizations here
Comet.ml — easily track, compare and debug your models!
If you haven’t tried out Comet it’s an awesome tool that allows you to track, compare and debug your ML models!
It works with Colab, Jupyter notebooks and scripts and most importantly it’s 100% free!
Get started with Experiment Tracking today!
References
[1] Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016).
[2] Goyal, Priya, et al. “Accurate, large minibatch sgd: Training ImageNet in 1 hour.” arXiv preprint arXiv:1706.02677 (2017).