Cloud ML Performance Checklist

A Guideline for Optimizing Cloud Based Training

Published in

Towards Data Science

11 min readAug 21, 2022

There are many advantages to training in the cloud. These include accessibility to a wide variety of training instance types, the ability to scale training to multiple instances and multiple parallel experiments without limitation, and the rich ecosystem of features and services facilitating ML workloads. However, if not managed properly, cloud ML can run up some pretty high costs. While the primary responsibility for developing and enforcing governance policies, monitoring the use and cost of cloud services, negotiating pricing strategies with the cloud service provider (CSP), etc., may fall on your organization’s cloud support team, algorithm developers too have a responsibility to do their part to increase efficiency and reduce waste. The meme in the embedded tweet below says it all.

This document describes a checklist of items that we have found useful in guiding algorithm developers towards increasing training efficiency and by extension reducing training costs.

Performance Measurement

Our primary measure of performance is Samples per Dollar, i.e. the number of samples that are traversed by the training loop for every dollar spent. The formula for Samples per Dollar is:

where samples per second = batch size * batches per second. The training instance cost can be found on your CSP’s website.

While your goal should be to maximize the Samples per Dollar, you sometimes need to account for other considerations. For example, you may decide to distribute training over multiple GPUs in order to accelerate training even though this may decrease the Samples per Dollar. Another example is when maximizing the samples per dollar would hurt the rate of convergence of your training job (e.g. a model that is sensitive to an increase in batch size).

Performance optimization is an iterative process as described in the following image:

Performance Optimization Flow (By Author)

This iterative process should accompany the entire lifecycle of your project.

Performance Analysis Tools

There are many tools and methodologies for analyzing performance. Some of the basic ones include the resource utilization metrics reported by your cloud service provider (CSP) and the TensorBoard profiler plugin.

This image shows an example of low GPU utilization as reported by the Amazon CloudWatch monitoring tools:

GPU Under-utilization in Amazon CloudWatch (by Author)

The TensorBoard profiler plugin is a powerful tool supporting analysis of many popular machine learning frameworks (including TensorFlow, PyTorch, and Jax). The profiler can appear a bit intimidating at first, but it is pretty easy to set up and is an extremely useful tool for identifying performance bottlenecks. See the profiler documentation for more details.

The following image from the TensorBoard profiler trace viewer shows long periods of GPU idle time indicating a significant bottleneck in the training step.

A Performance Bottleneck Shown with the TensorBoard Profiler Plugin (by Author)

Performance Analysis Checklist

This section includes basic guidance on measuring the current performance of your training job.

Does your training program include periodic reporting of the average number of training sample per second?
What is your current measure of Samples per Dollar?
Is the Samples per Dollar measurement relatively stable over the course of training?
Does you training script support an option to enable profiling, e.g. via the TensorBoard callback in TensorFlow?
What is the current average GPU memory utilization reported by the CSP’s cloud metrics? Generally speaking (but not always), it is a good idea to maximize GPU memory use by increasing the training batch size. This typically increases the samples per second as it reduces the cost of the fixed operations (such as kernel loading and gradient sharing). In some cases, the choice of batch size will have implications on the GPU memory alignment which can also impact performance.
What is the current average GPU utilization reported by the CSP’s cloud metrics? Your goal should be to try to increase this as much as possible. 95% utilization is a good target. Anything below 80% implies a bottleneck in your training step and should keep you up at night.
Other system metrics, including CPU and network utilization, may provide some hints into potential performance issues, but are usually a little harder to derive helpful information from. Do any of these metrics indicate a possible bottleneck?
A useful way to assess the performance of your training pipeline is to run your training loop on cached input. By doing this you can effectively isolate out all of the input pipeline computation and calculate the Samples per Dollar in this situation. This number represents the performance that can be reached when there are no bottlenecks on the data input pipeline. If there are bottlenecks that you are working to solve, this number should be viewed as your target performance. Does your script support the option of calculating Samples per Dollar on cached input? In Tensorflow, caching the input can be easily accomplished by applying: dataset = dataset.take(1).cache().repeat() at the end of the dataset creation (before prefetching). Note that additional insights into potential sources of CPU contention can be derived from applying the cache at other stages of the dataset and should be used when analyzing a bottleneck on the input pipeline.
The TensorBoard profiler is an extremely helpful tool for performance analysis. It is essential that you make it a regular part of your development cycle. A full overview of its capabilities is beyond the scope of this document, but it is highly advised (and less difficult than you think) to learn how to use it. Did you run (and analyze) the profiler?
Review the profiler summary report. Does it report an input pipeline bottleneck? (Note that if a lot of time is spent on “kernel loading” this might be an indication of high CPU contention and may indicate a CPU bottleneck.)
Open up the trace viewer. Are there any troublesome patterns such as long periods of idle GPU or frequent CPU to GPU copies?
Review the top 10 TensorFlow GPU ops. Are they what you would have expected? These can sometimes hint at potential optimizations to the model.

Instance Choice

The Samples per Dollar measurement can differ substantially based on the training instance you choose. Choosing the wrong training instance can have dire consequences on cost. Generally speaking, newer generation instances (such as the Amazon EC2 g5 and p4 instance types ) include HW enhancements that improve performance. Another consideration is the ratio of CPU to GPU compute power. A lower ratio could increase contention for CPU resources and result in a CPU bottleneck.

What is your current training instance? Did you consider other alternatives? If you are using older generation GPUs, such as Amazon EC2 p2 or p3, you should have a good reason for it.
Alternative (non GPU) accelerators such as Amazon EC2 dl1 (based on Habana Gaudi) and Google Cloud TPU (on GCP) often (but not always) offer increased price performance over GPUs. However, they sometimes require some adaptations/tuning to these accelerators. Have you attempted to run your model on a dedicated DNN training accelerator? What obstacles are there to using these alternatives?

Training Optimizations

The image below shows the stages of a typical training step. Each step is a potential source of a performance bottleneck:

The Steps of a Typical Training Step (by Author)

Storage to CPU

The first stage of a training step is loading the raw training samples onto the CPU. The manner in which your data is stored requires careful attention so that this stage does not become a bottleneck. This is especially true if your data is in a remote (cloud) object storage such as Amazon S3. Data starvation can be caused by limitations of the instance types network-in capacity, the S3 outbound capacity, or CPU contention. The training speed can be impacted by your choice of data format (e.g. sequential vs. columnar), the size of your data files, the method used for streaming data (PipeMode, FFM, direct download, etc.), use of data compression, etc.

One way to analyze the speed of the data streaming is by directly iterating over your dataset and measuring the Samples per Dollar without running the training step. Ideally, the result will be significantly higher than when running with the training step.

Are your file sizes in the order of ~100 MB?
If you are using Amazon SageMaker, are you taking advantage of FFM or PipeMode to stream your data?
What is the Samples per Dollar when iterating over the dataset without training?
What is the Samples per Dollar when applying data caching at the beginning of the input data pipeline? If it is higher than the base value, this may indicate a problem.
Using data compression can reduce your network bandwidth requirements and also reduce storage costs. On the other hand, data decompression could increase CPU contention. Have you looked into the potential impact of data compression on your training speed?

CPU Bottlenecks

One of the most common causes of GPU under-utilization are bottlenecks on the CPU. Here are some options to consider in case of a bottleneck due to CPU contention:

Are you using basic parallelization techniques such as batch prefetching and num_parallel_calls?
Are there any computations that can be moved to the data preparation phase?
Are there any computation layers that can be moved onto the GPU? Note that this will increase overall GPU compute but may result in lower step time.
Might offloading some of the input data pre-processing to auxiliary CPUs help? Note that there are a lot of considerations to take into account, such as the overall cost, the effect on network data traffic, etc.
Have you explored options for optimizing the CPU operations (e.g. using XLA)?

Increasing GPU Utilization

Even if your GPU is fully utilized, there are steps you can take to increase its performance:

Use mixed precision floats. Current versions of TensorFlow and PyTorch support 16-bit precision floats. When you use TensorFlow mixed precision, the model parameters are stored in full precision float representation whereas the arithmetic operations are performed with low precision floats. Using mixed precision can reduce the GPU memory utilization substantially (enabling increasing the batch size) and increase the samples per dollar. There are two types of 16-bit float representations, float16 and bfloat16. Bfloat16 is the preferred option since it has a similar dynamic range as float32. Bfloat16 is supported by modern GPUs (e.g. A100 and A10), Cloud TPU, and Habana Gaudi. If your instance supports bfloat16 it is likely that you can enable mixed precision without any effect on your model convergence. With float16 things can get a little more tricky, as you may need to employ gradient scaling techniques to ensure convergence. Does your instance support bfloat16? Are you using mixed precision?
XLA is a compiler that optimizes GPU computations by fusing together individual operations. In many cases enabling XLA will result in increased samples per dollar. Are you using XLA? What is the impact on performance?
Occasionally your computation graph may include an operation that is not supported by your accelerator. In this case, the op may be offloaded to your CPU. This can decrease your training speed substantially and should be avoided. In TensorFlow such an occurrence can be identified using tf.debugging.set_log_device_placement. This occurrence can also sometimes be identified in the TensorBoard profiler trace viewer where you will see frequent memory copies between the host and the device in the middle of the GPU computation. Did you ensure that all ops are running on the GPU?
DNN dedicated accelerators such as Cloud TPU and HPU (Habana Gaudi) can increase the samples per dollar but may require some remodeling to get the best performance. For example, both TPU and HPU do not perform well on tensors with dynamic shapes (e.g. boolean_mask). Is your model optimized for the accelerator on which you are training?
Are you using tf.py_function for any of your model layers? If so, you may want to look into additional performance alternatives.
The memory format can impact training performance. See, for example, this post on the importance of programming your memory format to the underlying accelerator. What memory format are you using? Is your choice optimal for your training resources?

Multi-GPU Training

Multi-GPU training is a common technique used for increasing the overall speed of development. A common way to use multiple GPUs is to perform data distributed training in which we divide the global batch size over the multiple GPUs, each of which maintain an identical replica of the model. Identicality is maintained by sharing the gradients at the end of each training step. Ideally, your training performance will scale linearly, i.e. the samples per dollar will not decrease. However, more often than not, the gradient sharing, and particularly the increased network communication traffic it implies, will lower the samples per dollar. There are, however, a number of ways in which this can be mitigated:

What is the samples per dollar of your multi-GPU training? How does it compare with the single GPU samples per dollar?
Generally speaking, you should always prefer a single instance with multiple GPUs over multiple instances. Is this the case?
CSPs often include specialized network interfaces that increase the performance of inter-node network communication (e.g. Amazon EFA and Google FastSocket). Are you training on multiple instances? Do the instances that you chose support accelerated inter-node network communication?
Did you explore alternative gradient sharing strategies? The common/default strategy for sharing gradients is Ring all-reduce. You might find that your model benefits from alternative strategies such as Herring.
Modern gradient sharing strategies will chunk the gradients and distribute them in parallel to the backward step. If the TensorBoard profiler trace viewer shows all of the device-to-device communication bunched together at the end of the step, this may indicate a problem.
Gradient sharing APIs will include an option for compressing the gradients to reduce network traffic. Did you consider this option?
Since multi-gpu is intended to accelerate training, and you may be less concerned with the evaluation time, you may want to consider the option of offloading evaluation to separate, single GPU instances.

Project Optimizations

The previous section focused on optimizations to a single training job. Here, we propose optimizations at the project level:

Does your project include callbacks for automated early stopping of failing experiments?
Are you using automated hyper-parameter tuning techniques?
Does your development process include deleting unused storage?
Are you using a tool (TensorBoard, comet.ai, clearML, Neptune, etc.) to manage your experiments? Efficient experiment management can increase order and reduce duplication.

Summary

The list shared on this page is by no means all inclusive. Naturally, it should be adapted to the specific needs of your own project. Please feel free to reach out with questions/comments/corrections and, in particular, with additional items that you feel should be included.