Making Sense of Big Data
A deep dive on how SageMaker Distributed Data Parallel helps speed up training of the state-of-the-art EfficientNet model by up to 30%
Convolutional Neural Networks (CNNs) are now pervasively used to perform computer vision tasks. Domains such as autonomous vehicles, security systems and healthcare are moving towards adopting CNNs in their application workflows. These use-cases typically require high accuracy models with a varying degree of computational requirements depending on where they are deployed – For instance, the computing infrastructure available to an edge-based security system is very different to a cloud-based medical imaging system.
However, training and integrating machine learning models into applications can be cumbersome, which is why we at AWS have developed Amazon SageMaker, a fully managed end-to-end machine learning (ML) platform. SageMaker provides tooling and manages infrastructure; thereby ML scientists and developers can focus solely on model development. You may peruse through the SageMaker Examples GitHub repository to get insights on how SageMaker can simply your machine learning pipeline.
SageMaker also allows you to train models faster by distributing training across multiple GPUs. To help you train your models faster and cheaper, our team at AWS has developed the SageMaker distributed data parallel (SMDDP) library to achieve near-linear scaling efficiency with minimal code changes. SMDDP performs optimized inter-node communication by leveraging AWS’s specialized network infrastructure and Amazon EC2 topology information to speed up your distributed training workload. For your reference, we have published a paper describing the internal design and the science behind SMDDP.
In this post, you will see how SMDDP can help you get up to 30% speed up in training of EfficientNet, a state-of-the-art model for computer vision tasks, in comparison to Horovod. We will first go through an overview of EfficientNet and SMDDP, following which we will do a step-by-step walk-through of how you can adapt existing EfficientNet code that uses Horovod with TensorFlow to instead use SMDDP. We will round up by looking at some performance measurements to help you understand the benefits of SMDDP. By the end of this post, you should be able to use SMDDP to speed up the training of your own models!
EfficientNet Overview
One of the key challenges with convolutional neural networks is with scaling the networks, i.e increasing the number of model parameters to yield higher accuracy. The commonly used strategy to scale CNNs has been to develop deeper models which have a higher number of layers. In fact, winning entries in ImageNet Large Scale Visual Recognition Challenge (ILSVRC) not surprisingly employed deeper CNNs over the years, with AlexNet using 8 layers in 2012 to ResNet using 152 layers in 2015.
However, scaling up CNNs in this manner is tedious, requiring a lot of fine tuning and experimentation to arrive at a network with the required accuracy and resource requirements. Researchers at Google addressed this problem in their ICML’19 paper where they developed a principled approach of scaling up CNNs, which they termed compound scaling. The key insight in compound scaling is that neural networks can be scaled along three dimensions –
- Depth: Increasing the number of layers in the network, which is the principal scaling method used in ResNets.
- Width: Increasing the number of neurons in a single layer or, more specifically, the number of filters employed in a convolutional layer.
- Resolution: Increasing the width and height of the input image.

Compound scaling essentially scales a network uniformly along the above three dimensions by a constant ratio, which the authors denote as compound coefficient ɸ. One can generate more accurate and computationally expensive models by using larger compound coefficients.
While the authors showed that compound scaling can apply generically on any baseline network architecture, the efficacy of compound scaling is heavily influenced by the choice of baseline network architecture. To this end, the authors used Neural Architecture Search to build an optimal network architecture, EfficientNet-B0. The main building block of this baseline network is the Mobile Inverted Bottleneck blocks used in MobileNetv2. EfficientNet-B0 achieves 77.1% accuracy on ImageNet with only 5.3 million parameters. In contrast, ResNet-50 offers 76% accuracy but uses 5 times the number of parameters. This makes EfficientNet a prime candidate for use in systems which desire lower computational overhead coupled with high accuracy, such as autonomous vehicles and security systems. Further, EfficientNets can be scaled for higher accuracy using compound scaling to yield EfficientNet-B1 to EfficientNet-B7, the integer at the end of the name indicating the compound coefficient ɸ. You can read more technical details about EfficientNet in this detailed blogpost.
Distributed Data Parallel Training and SMDDP
The amount of training data available to train models has grown over the years and will continue to grow in the future. As an example, in this post we train EfficientNet using the ImageNet dataset, which has more than a million training images. In our experience working with AWS customers, training data sizes can be much larger with training jobs often using more than 10–15 million training images! With such large training data, the time to run a single epoch (one full cycle over the entire training dataset) of training on a single GPU increases, making training prohibitively long and not amenable to business needs.
One can reduce the training time by using multiple GPUs and leveraging a technique termed as data parallelism. The data parallelism workflow is as follows –
- A worker process residing on each GPU has its own copy of the model and the training data is sharded amongst the workers.
- On each worker, we run one iteration of training on the corresponding shard of training data and compute gradients using the backpropagation algorithm.
- At the end of each iteration, all workers exchange locally computed gradients using an AllReduce algorithm and compute globally averaged gradients, which is then used to update local copies of the model.
During distributed training, the AllReduce step that we saw above involves communication of gradients across workers over the network. For state-of-the-art models such as EfficientNet that have more than a million parameters, exchanging gradients over the network induces significant communication overhead, due to large model gradients contending for limited network bandwidth between instances. Apart from slowing down training, the communication overhead limits the scalability of distributed data parallel training.
Ideally, wouldn’t we all love linear scaling efficiency, where the training speed improves proportionally to the number of GPUs used for training! The communication overhead becomes a barrier against achieving linear scaling efficiency and also results in expensive GPU resources being under-utilized.
Our team at AWS has recognized this key issue and developed the SageMaker distributed data parallel library (SMDDP) to provide a near-linear scaling efficiency, achieving a faster training speed with a minimal code change. The library leverages AWS infrastructure such as Elastic Fabric Adapter (EFA) and Amazon EC2 topology information to implement a custom, optimized AllReduce algorithm. SMDDP also uses CPUs instead of GPUs (other communication libraries like NCCL only use GPUs) to perform AllReduce, affording more GPU cycles to compute gradients. This allows a greater degree of overlapping between the backward pass and the communication of gradients, leading to reduction in training time. For a deep dive on the custom AllReduce algorithm, refer to our publication on SageMaker distributed data parallelism.
Training EfficientNet with SMDDP
For this blog post, we will fork a multi-GPU implementation of EfficientNet provided by NVIDIA that uses Horovod with TensorFlow. Atop this, we will need to make minor code changes to utilize the AWS SMDDP library instead of Horovod. SMDDP has a similar API spec to Horovod. This makes adopting a Horovod training script to use SMDDP straightforward for users who are familiar with the Horovod APIs. For your convenience, we have published the entire training script in our SMDDP-Examples GitHub repository. Below, we will take you through an overview of the key changes required.
- Import SMDDP’s TensorFlow client instead of Horovod’s TensorFlow client and initialize it.
# Import SMDDP client instead of Horovod's TensorFlow client
# import horovod.tensorflow as hvd
import smdistributed.dataparallel.tensorflow as sdp
# Initialize the SMDDP client instead of Horovod client
# hvd.init()
sdp.init()
- The EfficientNet training script uses the
rank()API provided by Horovod to obtain the global rank (logical global process number) of the worker. This is needed by several operations for sharding the dataset, checkpointing the model, and logging performance metrics. An example is shown below.
# Checkpoint only on rank 0
# Replace hvd.rank() calls with sdp.rank() as illustrated below
# if model_checkpoint and hvd.rank() == 0:
if model_checkpoint and sdp.rank() == 0:
ckpt_full_path = os.path.join(model_dir, 'model.ckpt-{epoch:04d}')
callbacks.append(tf.keras.callbacks.ModelCheckpoint(
ckpt_full_path, save_weights_only=True, verbose=1,
save_freq=save_checkpoint_freq))
- Wrap the optimizer in the SMDDP
DistributedOptimizerclass instead of the HorovodDistributedOptimizerclass.
# Replace Horovod's DistributedOptimizer class with SMDDP's DistributedOptimizer
# optimizer = hvd.DistributedOptimizer(optimizer,
compression=hvd.Compression.fp16)
optimizer = sdp.keras.DistributedOptimizer(optimizer,
compression=sdp.Compression.fp16)
- The training script uses Keras callbacks to broadcast initial model variables from the leader rank (rank
0) to all other workers. Replace Horovod’s callback API with SMDDP’s callback API.
# Replace Horovod's BroadcastGlobalVariablesCallback callback with
# SMDDP provided BroadcastGlobalVariablesCallback callback
# callbacks=[hvd.callbacks.BroadcastGlobalVariablesCallback(0)]
callbacks=[sdp.keras.callbacks.BroadcastGlobalVariablesCallback(0)]
- The training script uses
allreduce()calls particularly in the validation phase to distribute trained model evaluation and collect statistics like accuracy. Replace Horovod’sallreduce()calls with SMDDP’soob_allreduce()(out-of-band AllReduce) calls. Note that SMDDP offers bothallreduce()andoob_allreduce()APIs. Theallreduce()API must be used only on gradient tensors. For non-gradient tensors such as statistics, use theoob_allreduce()API.
# Replace Horovod's allreduce() call with SMDDP's oob_allreduce() call.
# SMDDP's oob_allreduce() does an average reduce operation by default.
# stats['training_accuracy_top_1'] = float(hvd.allreduce(tf.constant(
# train_hist['categorical_accuracy'][-1], dtype=tf.float32),
average=True))
stats['training_accuracy_top_1'] = float(sdp.oob_allreduce(tf.constant(
train_hist['categorical_accuracy'][-1], dtype=tf.float32))
Training EfficientNet Using SMDDP on SageMaker
Now that we have adapted the EfficientNet training script to use SMDDP, we next proceed to train EfficientNet on Amazon SageMaker. For your convenience, we have developed a detailed example notebook to walk you through the entire process of training EfficientNet on SageMaker. We recommend launching a SageMaker notebook instance to run the example notebook without having to do any setup. The following is an overview of some of the most important steps.
- Prepare the ImageNet dataset as a collection of TFRecords. TFRecords is a sequence of binary records encompassing the training data. It is serialized using Google’s protocol buffers format. You may follow the steps to download and convert the ImageNet dataset to TFRecords format and upload it to an Amazon S3 bucket. For large datasets like ImageNet, we recommend using Amazon FSx as your file system. FSx file system significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as it does with S3 input for SageMaker training job). FSx also provides a better data I/O throughput. The example notebook has steps to create an FSx linked with your S3 bucket that holds the ImageNet TFRecords.
- By default, SageMaker uses the latest Amazon Deep Learning Container (DLC) image for training. The example notebook has a script that uses the DLC for TensorFlow 2.6 as a base image, installs additional dependencies required for training the EfficientNet model based on NVIDIA NGC containers, and pushes the custom built Docker container to Amazon ECR. Using the custom Docker container’s image URI, you can construct a SageMaker estimator in the next step.
- Use the SageMaker Estimator classes that the SageMaker Python SDK provides to launch a training job. The estimator class allows you to configure parameters to specify the Docker image to use, the number and type of instances, and hyperparameters. See the following example of setting up a SageMaker TensorFlow estimator.
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3
sagemaker_session = sagemaker.Session()
# Configure the hyper-parameters
hyperparameters = {
"mode": "train",
"arch": "efficientnet-b4",
"use_amp": "",
"use_xla": "",
"max_epochs": 5,
"train_batch_size": 64,
"lr_init": 0.005,
"batch_norm": "syncbn",
"mixup_alpha": 0.2,
"weight_decay": 5e-6
}
estimator = TensorFlow(
entry_point="main.py",
role=role,
image_uri=docker_image, # name of docker image uploaded to ECR
source_dir="./tensorflow/efficientnet",
instance_count=2, # number of instances
instance_type="ml.p4d.24xlarge",
# Other supported instance types: ml.p3.16xlarge, ml.p3dn.24xlarge
framework_version="2.6", # TensorFlow 2.6
py_version="py38",
sagemaker_session=sagemaker_session,
hyperparameters=hyperparameters,
subnets=["<SUBNET_ID>"],
# Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids=["<SECURITY_GROUP_ID>"],
# Should be same as Security group used for FSx. sg-03ZZZZZZ
debugger_hook_config=False,
# Training using SMDataParallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
# Submit SageMaker training job
# data_channels is the FSx input
estimator.fit(inputs=data_channels, job_name=job_name)
Performance Comparison
We compare the performance of SMDDP against Horovod for training EfficientNet on SageMaker. We use multiple ml.p4d.24xlarge instances for training. Each ml.p4d.24xlarge instance comes with 8 NVIDIA A100 GPUs and has 400 Gbps instance network with support for EFA and GPUDirect RDMA (Remote Direct Memory Access). We pick training hyperparameters such as batch size and number of epochs based on scripts provided in the NVIDIA DeepLearningExamples repository. Note that training for the same number of epochs using both Horovod and SMDDP will yield the same set of parameters since the libraries orchestrate only the communication of gradients. We present performance results across the following two variants of EfficientNet: EfficientNet-B0 with 5.3 million parameters and EfficientNet-B4 with 19 million parameters.
NVIDIA A100 GPUs support training with Automatic Mixed Precision (AMP). SMDDP supports AMP out-of-the-box – When gradients are generated in FP16, SMDDP will automatically reduce gradients in FP16 mode. When EfficientNet-B0 is trained with AMP, we can observe that SMDDP offers upto 25% better performance in comparison to Horovod. When 8 ml.p4d.24xlarge instances are used, Horovod’s scaling efficiency drops to 94% whereas SMDDP is able to maintain a scaling efficiency of more than 97%.

When XLA (Accelerated Linear Algebra) is used to train EfficientNet-B0, we note that the performance benefit of SMDDP over Horovod drops to about 7%. One of the key design aspects used by data parallel libraries like Horovod and SMDDP is to overlap the communication of generated gradients with the computation of gradients using backpropagation. In effect, this hides high communication overheads and improves performance. Since XLA fuses GPU kernels to optimize performance, one unintended consequence to data parallel training is that it reduces the opportunity to overlap computation and communication. We recommend ML scientists and developers to evaluate training performance with and without using XLA compilation to identify the best choice for the specific model in hand.

We observe similar results with EfficientNet-B4 with SMDDP yielding about 16% better performance than Horovod along with better scaling efficiencies.

However, when XLA is used for training EfficientNet-B4, the performance benefit of SMDDP over Horovod increases to nearly 30%.

The results show that SMDDP can achieve up to 30% improvement in training throughput compared to Horovod. This means that you can train your model to converge faster and reduce billable time of using such expensive GPU resources. And to top it off, all of this is possible with the straightforward changes to the training script that we went through above.
Conclusion
In this blog post, you learned about how you can use SageMaker’s distributed data parallel (SMDDP) library to speed up and scale the training of EfficientNet, a state-of-the-art neural network architecture for computer vision tasks. SageMaker and SMDDP simplify and speed up training of models, enabling ML scientists and developers to innovate faster. We presented how you can adapt an existing EfficientNet training script to adopt SMDDP with a few lines of code change to achieve up to 30% improvement in performance.
We have several other PyTorch and TensorFlow examples available for you to further play around with SMDDP. We also encourage you to take what you have learned here and use SMDDP to accelerate the training of our own models. To reach out to us regarding any issues or feedback, you may raise an issue in the SMDDP-Examples GitHub repository.






