Making Sense of Big Data

Why Parallelized Training Might Not be Working for You

A guide to the motivations, benefits and caveats of parallelized training of Neural Networks

Dhruv Nair

Published in

Towards Data Science

7 min readApr 21, 2021

Introduction

There are a few methods for parallelized training of Neural Networks.

Inter-Model Parallelism a.k.a Parallelized Hyperparameter Search
Data Parallelism
Intra-Model Parallelism a.k.a Model Parallelism
Pipelined Parallelism

In this post we will explore the two most common methods for parallelism, Inter-Model Parallelism and Data Parallelism. We will use a simple CNN model on the CIFAR10 dataset to demonstrate these techniques. We’re going to train this model on a 4-GPU machine running the AWS Deep Learning AMI with Pytorch (1.7.1)

Advantages of Parallelized Training

The most obvious advantage of parallelized training is speed. In the case of a hyperparameter search, simultaneously evaluating multiple configurations allows us to quickly narrow down the most promising options.

With distributed data parallel (DDP) training, a copy of the model’s parameters are placed on each available GPU and each copy is fed a different subset of the entire dataset.

After every batch evaluation, the gradients of the the copies are synced and averaged. The weights of the copies of the model are updated based on these synced gradients.

This increases the effective batch size being used by the model, allowing it to train on batches that are larger than what would fit in memory on a single GPU. As the size of training datasets increase, DDP serves as a way to keep training times reasonable.

Significance of Batch Size

Using larger batch sizes results in faster training since the network is able to iterate over the entire dataset in fewer steps. However, empirical evidence suggests that larger batch sizes tend to converge to sharp minima in the loss surface, which lead to poor generalization [1]. In contrast, smaller batch sizes result in wider, flatter minima that have good generalization capabilities.

Let’s test this out by running a hyperparameter search over a set of candidate batch sizes. We will train our CNN for 10 epochs over CIFAR10 using a constant learning rate of 0.001.

We’re going to run the Comet Optimizer in Parallel and feed in an Optimizer Config file as a command line argument.

comet optimize -j 4 comet-pytorch-parallel-hpo.py optim.config

Source Code for Parallelized Hyperparameter Optimization

Here j is the number of parallel processes we want to start. You can find more details about the Comet Optimizer here

The optim.config file is just a JSON object that contains our parameter search grid.

{
   # We pick the Bayes algorithm:
   "algorithm": "random",
   # Declare your hyperparameters in the Vizier-inspired format:
   "parameters": {
      "batch_size": {
         "type": "discrete", 
         "values": [8, 32, 128, 512]
      },
   },
   # Declare what we will be optimizing, and how:
   "spec": {
     "metric": "val_loss",
     "objective": "minimize",
   },
}

Left: Runtime (seconds) for each Batch-Size Configuration, Right: Corresponding Test Accuracy for each Batch-Size Configuration. Source: *Comet.ml*

When looking at the line chart, we see that larger batch sizes lead to a shorter runtime for training. However, the larger batch sizes also result in a much worse train error, and test accuracy. In fact, the smaller the batch size, the better the test accuracy. This is a problem, since the smallest batch size takes nearly 8X longer to complete.

The smallest batch size takes nearly 8X longer to complete.

Scaling the Learning Rate

A key aspect of using large batch sizes involves scaling the learning rate. A general rule of thumb is to follow a Linear Scaling Rule [2]. This means that when the batch size increases by a factor of K the learning rate must also increase by a factor of K.

Let’s investigate this in our hyperparameter search. Let’s linearly scale our learning rate depending on the batch size. We will use a batch size of 8 as our scaling constant.

Source Code for these Scaled Learning Rate Experiments

Left: Train Loss Curves for different batch sizes with Scaled Learning Rate, Right: Test Accuracy for configurations with Scaled Learning Rates. *Source:* *Comet.ml*

That seemed to do the trick. Scaling the learning rate based on batch size allows us to mitigate the generalization gap! Let’s combine this idea with our distributed data parallel approach.

Using a Larger Effective Batch Size

With DDP training the dataset is divided amongst the number of available GPUs.

Lets run a set of experiments with using the Pytorch Distributed Data Parallel Module. The Module handles copying the model to each GPU as well as synchronizing the gradients and updating the weights across GPU processes.

Using the ddp module is quite straight forward. Wrap your existing model within the DDP module, and assign it to a GPU

model = Net()
model.cuda(gpu_id)
ddp_model = DDP(model, device_ids=[gpu_id])

We will use the DistributedSampler object to ensure that the data is distributed properly across each GPU processes

# Load training data
    trainset, testset = load_data()
    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs]
    )

    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_subset, num_replicas=world_size, rank=global_process_rank
    )

    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=PER_REPLICA_BATCH_SIZE,
        sampler=train_sampler,
        num_workers=8,
    )
    valloader = torch.utils.data.DataLoader(
        val_subset, batch_size=PER_REPLICA_BATCH_SIZE, shuffle=True, num_workers=8
    )

Finally run the training loop using the DDP wrapped model

for epoch in range(args.epochs):
   train_loss = train(ddp_model, optimizer, criterion, trainloader,       epoch, gpu_id)
   val_loss, val_acc = evaluate(ddp_model, criterion, valloader, epoch, gpu_id)

Source Code for DDP Examples

In our case, each GPU receives a quarter of the dataset. When we run training in this manner our effective batch size is the product of the number of GPUs and the batch size per GPU. So when we set a batch size per GPU of 8, our effective batch size is actually 32. We can verify this by comparing the DDP training run to a single GPU training run with batch size 32. Notice how the curves look similar, and run for a similar number of training steps. These curves show that even though we are using a smaller batch size per GPU process, our model performance is still dependent on the effective batch size.

Putting it All Together

Lets combine what we’ve learned about learning rate scaling and rerun DDP training.

Left: Train Loss with Scaled Learning Rate and DDP Training, Right: Test Acc with Scaled Learning Rate and DDP Training. *Source:* *Comet.ml*

It seems that even learning rate scaling has its limits. Training improves as we increase our batch size up to a certain value, after which we see our test accuracy starting to deteriorate.

Caveats

Theoretically, distributing the training over multiple GPUs should speed up the overall time taken to train a model. In practice, this is not always the case. Synchronizing the gradients across multiple GPU processes incurs a communication overhead that is not trivial to minimize.

In the Panel below we compare the runtime duration for a single process training run with batch size 32 to a DDP training run with the same effective batch size. The single process run takes 73 seconds to complete, while the DDP training run is almost eight times slower, taking 443 seconds to complete.

This is likely due to the fact that the gradients are being synchronized every time we call loss.backward() in our training code. The constant communication between processes causes the overall runtime to increase. This is something to be mindful of when setting up DDP training. If you are running your experiments in a multi-machine setting, ensure that the network bandwidth is sufficient to handle each process sending and receiving the entire model's checkpoint data.

Another way to speed up this process is to change the synchronization schedule of the gradients, but that is out of scope for this report.

Training Loss Curves Single Process Training vs Distributed Training. *Source:* *Comet.ml*

Conclusion

We hope you found this introductory guide to parallelized training useful. In subsequent posts we will go over how to optimize your code for distributed training in order to get the most out of your compute infrastructure.

Read the full report with interactive visualizations here
Comet.ml — easily track, compare and debug your models!
If you haven’t tried out Comet it’s an awesome tool that allows you to track, compare and debug your ML models!
It works with Colab, Jupyter notebooks and scripts and most importantly it’s 100% free!
Get started with Experiment Tracking today!
Try Comet for Free Today!

References

[1] Keskar, Nitish Shirish, et al. “On large-batch training for deep learning: Generalization gap and sharp minima.” arXiv preprint arXiv:1609.04836 (2016).

[2] Goyal, Priya, et al. “Accurate, large minibatch sgd: Training ImageNet in 1 hour.” arXiv preprint arXiv:1706.02677 (2017).