TensorFlow Performance Analysis

Published in

Towards Data Science

23 min readAug 23, 2020

In previous posts (here and here), I told you about how our team uses the Amazon SageMaker and Amazon s3 services to train our deep neural networks on large quantities of data.

In this blog, I would like to discuss how to profile the performance of a DNN training session running in TensorFlow. When speaking of the “performance” of a DNN training session, one may be referring to a number of different things. In the context of this blog, “performance” profiling will refer to analysis of the speed at which the training is performed (as measured, for example, by the training throughput in iterations per second), and the manner in which the session utilizes the system resources to achieve this speed. We will not be referring to the performance of the model being trained, often measured by the loss or metric evaluation on a test set. An additional measure of performance is the number of batches required until the training converges. This is also out of the scope of this blog.

In short, if you are trying to figure out why your training is running slowly, you have come to the right place. If you are searching for ways to improve the accuracy of your mnist model, or are searching for what optimizer settings to use to accelerate convergence, you have not.

The examples we will review were written in TensorFlow and run in the cloud using the Amazon SageMaker service, but the discussion we will have is equally applicable to any other training environment.

Prelude

Any discussion on performance profiling your training requires that we be clear about what the goal is, or, what utility function we are trying to optimize. Your utility function will likely depend on a number of factors, including, the number of training instances at your disposal, the cost of those instances, the number of models you need to train, project scheduling constraints and more.

In order to have a meaningful discussion, we will make some simplifying assumptions. Our goal will be to maximize the throughput of a training session, given a fixed training environment, without harming the quality of the resultant model, or increasing the number of training samples required for convergence.

The goal, as stated, includes some ambiguities that we will promptly explain.

Training Instance

Our first simplifying assumptions are that the training is being performed on a single instance/machine and that the instance type is fixed.

Naturally, different models perform differently on different types of machines. In an ideal situation, we would always be able to choose a machine that is optimal for the model being trained, that is, a machine on which all resources would be fully utilized. That way, we would avoid the cost of resources than are not being used. Unfortunately, in the real world, we are usually faced with a fixed number of instance types to choose from. For example, Amazon SageMaker offers a wide variety of instance types to choose from, that differ in the types and number of GPUs, the number of CPUs, the network properties, memory size and more. On the other hand, one does not have the ability to freely choose (based on the properties of their model) a machine with a specific number of CPUs, a specific GPU and specific network bandwidth.

In order to choose the most appropriate training instance, one must carefully weigh how well a model is suited to different training instances, versus considerations such as the cost and availability of those instances, as well as scheduling requirements. This requires a comprehensive analysis of the maximum achievable performance of training the model on each of the different instances types, as we describe in this post.

We will limit our discussion to instance types with a single GPU. Specifically, we will work on machines with an NVIDIA® V100 Tensor Core GPU. If you are using the Amazon SageMaker service for training, this is the ml.p3.2xlarge instance type.

Training Framework

There are many different libraries and frameworks available for training DNN models. As before, the training performance of a fixed model, fixed training algorithm and fixed HW, will vary across different SW solutions. As highlighted in the title of the post, our focus will be on the TensorFlow training environment. Even within this training environment, performance may depend on a number of factors such as the framework version, whether you choose a high level API, such as tf.estimator or tf.keras.model.fit, or implement a custom training loop, and the manner in which you feed your training data into the training pipeline. Our assumptions in this post will be that the training will be performed in TensorFlow 2.2, using the tf.keras.model.fit() API, and that the data will be fed using the tf.dataset APIs.

Rules and Constraints

Where there no constraints, speeding up the training throughput would be a piece of cake. For example, we could simply reduce the size of our model architecture and training would fly. Of course, this is likely to kill the ability of the training to converge and the resultant model predictions may be useless. Obviously, we must ensure that any actions we take to accelerate the training does not harm the quality of the resultant model.

For our discussion, we introduce an additional constraint, which is, that we do not want to impact the overall time required for training convergence, as measured by the overall number of training samples fed into the network until convergence. In other words, we make the simplifying assumption that our model requires a fixed number of training samples to converge and that the options we have to optimize will not change that fixed number. Based on that assumption, and given a fixed instance and SW environment, our goal is to optimize the overall amount of time (in seconds) that the training takes, or, alternatively, the training throughput, measured in the number of samples being processed by the training loop per second.

We emphasize that this is an artificial constraint we introduce to simplify the discussion. There certainly may be situations where a change to the model architecture would increase the number of samples required for convergence, while reducing the overall time to convergence. For example, reducing the bit precision of some of the variables is likely to increase the throughput, but also might require more steps to converge. If the overall time to train is reduced, and the resultant model predicts just as well, then this would be a great optimization. Another important example is increasing the batch size. Increasing the training batch size is one of the basic ways to increase throughput. However, there are some models for which the training quality may be sensitive to the batch size. We make the simplifying assumption that such an impact is negligible and can be overcome by tuning other hyper-parameters (such as the learning rate).

Measurement Units

The units we will use to measure the throughput are the average number of samples processed by the training loop, per second. An, alternative, more common unit measurement for throughput is the average number of training iterations per second, steps per second, or batches per second. However, since this measurement is based on the chosen (global) batch size, it makes it hard to compare between runs that use different batch sizes.

To extract this measurement, you can either rely on the logs that are printed by TensorFlow (at the end of every epoch), or implement the measurement on your own (e.g. with a custom TensorFlow callback).

Optimization Strategy

Our main method for increasing training throughput is by increasing the utilization of the resources on our training instance. In theory, any underutilized resource is a potential opportunity to increase throughput. However, practically speaking, once the GPU, the most expensive and important system resource, is fully utilized, and assuming we are satisfied with the performance of our GPU operations, we will consider our mission accomplished. We could, in theory, consider offloading some of the training to the CPU, but the performance gain would likely be negligible (if there is any at all) — certainly not worth the headache. Underutilization could be a result of the pipeline not reaching its maximum capacity (e.g. all of the resources are under-utilized). In this case, increasing the throughput is fairly easy, (e.g. simply increase the batch size). If the pipeline is at maximum capacity, under-utilization of a system resource is typically the result of a bottleneck in the pipeline. For example, the GPU might be idle while it waits for training data to be prepared by the CPU. In such cases, we will attempt to increase the throughput by removing the bottleneck. For example, we might introduce parallelization between the CPU and GPU so that the GPU never has to wait for data.

The general strategy is to perform the following two steps iteratively until we are satisfied with the throughput, or until the GPU is fully utilized:

1. Profile the training performance to identify bottlenecks in the pipeline and under-utilized resources

2. Address bottlenecks and increase resource utilization

Note that, at any given time, there may be more than one bottleneck in the pipeline. The CPU might be idle as it waits to receive data from the network, and the GPU might be idle as it waits for the CPU to prepare the data for training. Additionally, different bottlenecks might pop up at different stages of the training. For example, the GPU might be fully utilized but idle only during iterations when we save a checkpoint. Such bottlenecks, even if only periodic, could still potentially have a significant impact on the average throughput and should be addressed.

Why is Performance Profiling Important?

This should be abundantly clear by now, but sometimes the obvious requires stating. All too often, I have encountered developers who are perfectly content with 40% GPU utilization. It makes me want to scream. The GPU is (usually) the most expensive resource you are using; if you are happy with any less than 90% utilization, you are wasting your (your company’s) money. Not to mention that you could probably be delivering your product much sooner.

It’s all about the money. Training resources are expensive, and if you are not maximizing their utilization, you are committing a crime against humanity.

I hope you are convinced of the importance of profiling your training performance and maximizing resource utilization. In the next section we will review some of the potential bottlenecks in a typical training pipeline. We will then survey some of the tools at our disposal for identifying these bottlenecks.

The Training Pipeline and Potential Bottlenecks

In order to facilitate the discussion on the possible bottlenecks within a training session, we present the following diagram of a typical training pipeline. The training is broken down into eight steps, each of which can potentially impede the training flow. Note that while in this diagram training is performed on multiple GPUs, in this post, as mentioned above, we will assume that there is just one GPU.

Raw Data Input

Unless you are auto-generating your training data, you are likely loading it from storage. This might be from local disk or it might be over the network, from a remote storage location. In either case, you are using system resources that could potentially block the pipeline. If the amount of raw data per training sample is particularly large, if your IO interface has high latency, or if the network bandwidth of your training instance is low, you may find your CPU sitting idly as it waits for the raw data to come in. A classic example of this is if you train with Amazon SageMaker using “filemode”. In “filemode” all of the training data is downloaded to local disk before the training even starts. If you have a lot of data, you could be waiting a while. The alternative Amazon Sagemaker option is to use “pipemode”. This allows you to stream data directly from s3 into your input data pipeline, thus avoiding the huge bottleneck to training startup. But even in the case of “pipemode” you can easily run up on resource limitations. For example, if your instance type supports network IO of up to 10 Gigabits per second, and each sample requires 100 Megabits of raw data, you will have an upper limit of 100 training samples per second, no matter how fast your GPU is. The way to overcome such issues is to reduce your raw data, compress some of the data or to choose an instance type with a higher network IO bandwidth.

In the example we gave, the limitation came from the network IO bandwidth of the instance, but it could also come from a bandwidth on the amount of data that you can pull from s3, or from somewhere else along the line. (Side note: if you are pulling data from s3 without using pipe mode, make sure to choose an instance type with Elastic Network Adapter enabled.)

Data Preprocessing

The next step in the training pipeline is the data pre-processing. In this stage, typically performed on the CPU, the raw data is prepared for entry to the training loop. This might include applying augmentations to input data, inserting masking elements, batching, filtering and more. The tensorflow.dataset functions include built-in functionality for parallelizing the processing operations within the CPU (e.g. the num_parallel_calls argument in the tf.dataset.map routine), and also for running the CPU in parallel with the GPU (e.g. tf.dataset.prefetch). However, if you are running heavy, or memory intensive computation in this stage, you might still find yourself with your GPU idle as it waits for data input.

Update: Be sure to check out my recent post, on the topic of performance bottlenecks in the data preprocessing pipeline.

CPU to GPU Data Transfer

In most cases, your CPU and GPU use different memory, and the training samples need to be copied from CPU memory to GPU memory before the training loop can run. This too could potentially result in a bottleneck, depending on the size of your data samples and the interface bandwidth. One thing you should consider, wherever possible, is to hold off on casting to higher bit representation (tf.cast()) or decompressing bit masks (tf.one_hot), until after the data is copied to GPU memory.

GPU Forward Backward

The heart of the training pipeline is the forward and backward pipeline. This step is performed on the GPU. Being that the GPU is the most expensive resource, we want the GPU to constantly be active and at peak performance. In most cases, the average throughput, in number of samples trained per second, increases as we increase the batch size, so we will try to increase the batch size up to the GPU’s memory capacity.

Obviously, the throughput of this step is a function of your model architecture and your loss function. There are quite a number of common methods for reducing the computation. Here is a very small sample of them:

· prefer conv layers over dense layers

· replace large convolutions with a series of smaller ones with the same receptive field

· use low precision or mixed precision variable types

· consider using TensorFlow native functions instead of tf.py_func

· prefer tf.where over tf.cond

· research how your model and layer settings, such as memory layout (channels first or last), and memory alignment (layer input and output size, number of channels, shapes of convolutional filters, etc) impact your GPU performance and design your model accordingly

· customize the graph optimization

One additional thing to look out for are graph operations, that are not supported by the GPU, and, therefore, offloaded back on to the CPU. Occasionally, you might find that in the middle of your train step, the GPU will send data back to the CPU for processing, and wait until the processed data is returned before resuming computation. Needless to say that this will kill your throughput. In this case, you should use profiling tools to pinpoint the problematic operations and make the necessary adjustments so that they run on the GPU.

Gradient Sharing

This step is only performed when training running distributed training on multiple GPUs, either on a single training instance or on multiple instances. We leave the discussion on distributed training for a future post, but will note that this step can also potentially introduce a bottleneck. During distributed training, each GPU must collect the gradients from all other GPUs. Depending on the distribution strategy, the number and size of the gradients, and the bandwidth of the communication channel between GPUs, a GPU may find itself idle while it collects the gradients. To solve such issues, one might reduce the bit precision of the gradients, tune the communication channel or consider other distribution strategies.

GPU to CPU Data Transfer

During training, the GPU will return data to the CPU. Typically, this will include the loss and metric results, but may, periodically, also include more memory intensive output tensors or model weights. As before, this data transfer can potentially introduce a bottleneck at certain phases of the training, depending on the size of the data and the interface bandwidth.

Model Output Processing

The CPU might perform some processing on the output data received from the GPU. This processing typically occurs within TensorFlow callbacks. These can be used to evaluate tensors, create image summaries, collect statistics, update the learning rate and more. There are different ways in which this could reduce the training throughput:

· If the processing is computation or memory intensive, this may become a performance bottleneck. If the processing is independent of the model GPU state, you might want to try running in a separate (non-blocking) thread.

· Running a large number of callbacks could also bottleneck the pipeline. You might want to consider combining them into a small number.

· If your callbacks are processing output on every iteration, you are also likely to be slowing down the throughput. Consider reducing the frequency of the processing, or adding the processing to the GPU model graph (e.g. using custom TensorFlow metrics).

CPU to Storage

During the training the CPU might periodically transfer event files, log files, or model checkpoints to storage. As before, a large quantity of data combined with a limited IO bandwidth could potentially lead to latency in the training pipeline. Even if you are careful to make the data transfer non-blocking (e.g. using dedicated CPU threads) you might be using network input and output channels that share the same limited bandwidth (as on Amazon SageMaker instances). In this case, the amount of raw training data being fed on the network input could drop.

One way this could happen is if we collect all of our TensorFlow summaries in a single event file which grows and grows during the course of the training. Each time the event file is uploaded to s3, the amount of data passing on the network increases. When the file becomes very large, the data upload can interfere with the training.

Performance Analysis Tools

Now that we have an understanding of some of the bottlenecks we might face in a training pipeline, let’s review some of the tools and techniques at your disposal for identifying and analyzing them.

Instance Metrics

A good place to start is to assess how the resources of the training instance are being utilized. In Amazon SageMaker these are provided as “Instance Metrics” which are displayed in the “Monitor” section of the training job page, as well as in Amazon SageMaker Studio. Here is an example of how this appears on the training job page:

The Instance Metrics include graphs of the memory utilization, CPU utilization, network in, network out, GPU memory utilization, GPU utilization, and disk utilization, where measurements are displayed at five minute intervals. You can use this to make sure things are running as expected and to get a high level indication of what can be improved. (We refer to this as “high level” because of the relatively infrequent measurements.) For example, you can verify that you are, in fact, running on the GPU, and assess how well you are using it. In addition, you can identify anomalies in the different metrics such as:

· dips in the CPU/GPU utilization which could indicate a bottleneck

· rising CPU memory which could indicate a memory leak

· a choppy network-in could indicate an issue with how we are pulling data from s3

· if the network-in is at maximum capacity (when compared to the training instance properties) this could indicate a bottleneck on the input pipeline

· a rising network-out could indicate that we are uploading a single file that keeps growing instead of uploading small files

· a delay in GPU start time could indicate a lengthy model compilation period

One thing worth noting is that the default behavior of TensorFlow is to take up all of the GPU memory. So don’t be overly alarmed (or happy) that the GPU memory utilization shows 100%. To configure TensorFlow to use only the memory it actually needs, you need to apply the lines of code below.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

Pressing on the “View Instance Metrics” link will take you to the CloudWatch Management Console, where you can play around a bit with the measurements and graphs. For example, you can zoom in to specific intervals of the training, decrease the measurement intervals (to one minute), and display multiple metrics on a single graph.

SageMaker allows you to include your own, custom, metrics in this window as well. These are referred to as “Algorithm Metrics”. For details, see the SageMaker documentation. For example, you can define the training throughput as a metric and have it displayed in this window as well.

This method of performance analysis leaves much to be desired. Aside from the infrequency of the measurements, we receive little information regarding which parts of code require improvement. An underutilized GPU could stem from a low batch size or from a bottleneck. A highly utilized GPU may be a result of an inefficient loss function. Getting more information will require some heavier artillery.

If you are running on your own training instance and not within the SageMaker service, you will need to collect the “instance metrics” yourself using common, off-the-shelf system profilers. In a linux environment you can use command line tools such as nmon, perf, strace, top and iotop. If you are using an NVIDIA® GPU you can use nvidia-smi (with the -l flag).

Performance Profilers

To get a more in depth picture of how the different stages of your training session are performing, you should use a performance profiler. Different profilers are available for different development frameworks. TenorFlow offers the TensorFlow Profiler for profiling tf models. Your first encounter with a performance profiler can be a bit intimidating. There are often multiple graphs, that might appear overloaded and seem, at first, confusing. It is not always immediately clear how to interpret the wealth of data that is reported. But once you get the hang of it, profilers can be very useful in evaluating resource utilization, identifying bottlenecks and improving model performance.

TensorFlow Profiler

Check out the TensorFlow Profiler Guide and the TensorBoard Profiler Tutorial for instructions on how to use the profiler. When training with tf.keras.model.fit(), the easiest way to integrate profiling is by using the TensorBoard Keras Callback as follows:

tbc = tf.keras.callbacks.TensorBoard(log_dir='/tmp/tb',       
                                     profile_batch='20, 24')

This will capture statistics on iterations 20 to 24 and save them to a file that can be opened and analyzed in TensorBoard. (The size of the interval impacts the amount of data captured.)

Note that the TensorFlow Profiler callback is limited in that it runs on a range of iterations, not on the whole training session. This means that to evaluate specific iterations (e.g. iterations where you save checkpoints) or to get a sense of how performance changes over time, you will need to run different profiling sessions and compare between them. Alternatively, you could use the more advanced profiling APIs that will allow you to collect statistics more freely.

Programmatic Techniques

Sometimes that best way to get a good feel for how your training performs is by playing around with your program to see how different changes impact the performance. Here is a small sample of things you can try:

· You can train with different batch sizes to test the maximum GPU memory capacity and assess the impact on GPU utilization.

· You can loop on the training dataset without performing training as described here to measure the maximum throughput of the input pipeline.

· You can train with generated data to assess the maximum GPU throughput.

· You can add and remove different parts of the pre-processing pipeline to evaluate their impact on throughput.

· You can train with different, simple and complex loss functions to evaluate the impact of the loss function on throughput.

· You can train with and without callbacks.

· If you want to evaluate the impact of specific functions, replace them with simple dummy functions to assess impact.

Another useful programming technique is to simply add prints (e.g. using tf.print()) and timers (e.g. tf.timestamp()) to evaluate the performance of different blocks of code.

Information-Interference Trade-off

The information-interference trade-off refers to the simple observation that the more we change the original pipeline in order to extract meaningful performance data, the less meaningful that data actually is. The more we increase the frequency at which we poll the system for utilization metrics, the more the activity of the actual profiling begins to overshadow the activity of the training loop, essentially deeming the captured data useless.

Finding the right balance is not always so easy. A complete performance analysis strategy should include profiling at different levels of invasion in order to get as clear a picture as possible of what is going on.

Case Study

In this section we will demonstrate some of the potential performance issues we have discussed in action. Note, that many of the examples we will show were inspired by true events; real issues we encountered during our training on AWS. For each example, we will demonstrate how to identify the performance issues by selecting and analyzing some of the profiling measurements.

The model we will use is deep convolutional network that learns to perform pixel level segmentation on an input image. Our base model parallelizes the CPU and GPU processing and runs with a batch size of 64. In this case the training throughput is roughly 100 samples per second. The SageMaker dashboard reports GPU memory utilization of 98% and GPU utilization of between 95% and 100%.

The efficiency of the GPU utilization can also be seen from the trace_viewer of the tf profiler where we can see that the GPU is almost always active. Recall that the tf profile was configured to run for five steps, steps 20–24.

tf profiler — trace-viewer (by author using TensorBoard)

For the sake of comparison, we will also note that the Instance Metrics reports network-in of 14.9 GBytes per minute, and network-out of under 100 MBytes per minute.

Small Batch Size

To evaluate the impact of the batch size on the performance, we reduce the batch size to 2, leaving all other model properties unchanged. The throughput drops all the way down to 40 samples per second. The effect on GPU utilization and GPU memory utilization is immediately noticeable from the “instance metrics” where we see a significant drop, down to around 60% and 23%, respectively. Naturally, each iteration takes a lot less time, but the percentage of the time during which the GPU is active is much lower.

The tf profiler step time graph shows that the small batch size leads to over half the time being spent loading kernels to the GPU.

Small Batch Size — tf profiler Step-time Graph (by author using TensorBoard)

The trace_viewer shows that while the GPU remains unblocked by the input, the tf ops appear to be sparsely scattered across the computation block of each step.

Small Batch Size — tf profiler trace-viewer (by author using TensorBoard)

In TensorFlow 2.3, a new Memory profiler tool was introduced that allows you to identify underutilization of the GPU memory and get an indication of whether you can safely increase the training batch size.

Network Input Bottleneck

In this example, we will artificially introduce a network bottleneck on the network input. We will do this by dropping every 9 out of 10 input records so that we require 10 times as much input data on the network to maintain the same throughput. Here is the code used to perform this exercise:

# id generator
def id_gen():
    e = 0
    while True:
        yield e
        e += 1# enter id to feature dict
def add_id_label(f, i):
    f['id'] = i
    return f# create a dataset of indices and zip it with input dataset
ids = tf.data.Dataset.from_generator(id_gen, tf.int64)
ds = tf.data.Dataset.zip((ds, ids))# enter id to feature dict
ds = ds.map(add_id_label, num_parallel_calls=tf.data.experimental.autotune)# filter out every 9 out of 10 records
ds = ds.filter(lambda f: 
               tf.squeeze(tf.equal(tf.math.floormod(f['id'],10),0)))

From the instance metrics, we can see that the network-in caps out at 33.4 GBytes per minute, only a bit more than twice the volume of the normal run (14.9 GBytes) despite the fact that we need ten times as much data. Throughput drops to 22 samples per second. Unsurprisingly, our program is highly input bound. The tf profiler reports that, of the total step time, 77.8% is spent waiting for data.

NetworkIn Bottleneck — tf profiler Step-time Graph (by author using TensorBoard)

The trace_viewer clearly shows the GPU sitting idle for the majority of each step as it waits for data from the tf_data_iterator_get_next op.

NetworkIn Bottleneck — tf profiler trace-viewer (by author using TensorBoard)

Overloaded Pre-processing Pipeline

TensorFlow offers ways to maximize the parallelization of the data processing (as demonstrated in the TensorBoard Profiler Tutorials) but this does not absolve you from optimizing your input data processing pipeline. If your data processing is resource intensive, it may impact the performance of your model, and you might want to consider moving some of your processing offline. If the heavy operation is GPU friendly, (e.g. a large convolution), you can also consider performing the operation on the GPU using the experimental map_on_gpu function, (but keep in mind the overhead of the data transfer). Another option is to choose a training instance with more CPU cores.

In this example, we will simulate an overloaded pre-processing pipeline, by applying a separable_conv2d with filter size 7x7 to the input frame.

The throughput drops to just 25 samples per second, and the maximum GPU utilization to 36%. The CPU utilization, on the other hand, jumps from 66% to 96%. Once again, the program is highly input bound, and the trace-viewer shows large chunks of GPU idle time.

Preproc Bottleneck — tf profiler trace-viewer (by author using TensorBoard)

Careful analysis of the CPU section of the trace-viewer, (not shown here), shows that separable convolution taking up large chunks of the compute.

Bottleneck on CPU GPU data Interface

In this example, we artificially increase the input data being passed to the GPU by blowing up the size of the input frame by a factor of 10. On the CPU side we simply replicate the input frame 10 times (using tf.tile()). On the GPU we receive the enlarged input frame, but immediately discard the added data.

The throughput in this case drops to 84 samples per second, and the bottleneck is clearly evident on the trace-viewer.

H2D bottleneck — tf profiler trace-viewer (by author using TensorBoard)

Notice how, for each step, the size of the block of Stream #20(MemcpyH2D) has grown, and how the GPU compute remains idle until the block has completed.

Inefficiency in GPU

In this example, we simulate the impact of an inefficient graph, by applying an 11x11 convolution filter to the output of the model before calculating the loss function. Unsurprisingly, the throughput drops slightly, to 96 samples per second. The impact on the graph can be viewed on the tf profiler tensorflow stats page, where we see that the added operation becomes the most time-consuming operation in the GPU. This table gives us information on the heaviest operations, which we can use to improve the model performance.

Inefficient Graph — tensorflow stats (by author using TensorBoard)

Heavy Post Processing

In this example, we add a callback function that simulates processing the segmentation masks that are output by the model, by creating and storing 64 random images after every iteration. The first thing we notice, is that TensorFlow prints the following warning:

WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.814319). Check your callbacks.

Additionally, the throughput drops to 43 samples per second, the GPU utilization drops to 46%, and tf profiler reports that the GPU is active for only 47% of each time step. The bottleneck is clearly seen on the trace-viewer where we see the GPU idle for the second half of each step.

Postproc Bottleneck — tf profiler trace-viewer (by author using TensorBoard)

Summary

As we have shown, the ability to analyze and optimize the performance of your training sessions, can lead to meaningful savings in time and cost. The skills required to perform such analysis should exist in your DNN development team. The analysis should be an integral part of your team’s development methodology and incorporated into your DNN training life cycle. Your development plan should include details such as at when to run performance profiling, what tools to use, what type of tests to run, how invasive the tests should be, and more.

In this post we have barely touched the surface of the world of performance analysis. There are, no doubt, many more tools and techniques, other kinds of bottlenecks, and other ways to squeeze more performance out of your training resources. Our goal was merely to introduce you into this world, and emphasize its importance in your day to day training. Best of luck to you!!