Profiling in Tensorflow 2.x

Published in

Towards Data Science

5 min readDec 29, 2020

Photo by National Cancer Institute on Unsplash

In the last two posts, we discussed design choices for a Tensorflow dataloader using the tf.data.Dataset API and the tf.function decorator to reduce time and memory demands. This post shall look into how to use the inbuilt profiler in Tensorflow to primarily see if our deep learning models are sitting idle waiting for data. Those users with advanced knowledge of Tensorflow, can directly refer this post and skip the first two.

The profiling feature shall once again be discussed in context of automated segmentation of organs in a CT scan of the head and neck (HaN) area. This is important for accurate radiotherapy targeting of the tumor in a way that minimizes radiation to the organs of a human body.

Installation

Installing Tensorflow has become relatively simple over the years, for e.g. installation with a package manager like Anaconda is quite simple. But to use Tensorflow’s profiler, we need another package called CUPTI (CUDA Profiling Tools Interface) built by Nvidia and leveraged by Tensorflow. To install this, follow either of the steps (only on Linux systems)

Option 1 — Install Tensorflow using Anaconda as shown here and then type the command conda install cupti
Option 2 — Install all the packages for medical image data loading (along with CUPTI) using the git repository associated with this post.

Once CUPTI is installed, you have to grant your user access to the NVIDIA performance counters. Refer this article to grant access.

Profiling

This post shall look into using the Tensorflow profiling API in a custom training loop scenario. There are two simple commands to start and stop profiling which are self explanatory

Start— tf.profiler.experimental.start(logdir)
Stop — tf.profiler.experimental.stop()

Once your training loops have been profiled, you can use tensorboard to visualize the profiling data.

Note: It is advised to start your looping after a few steps of training, so that the profiling does not consider initialization overhead. Profiling generates files with a lot of information, so one should stop profiling after a predefined number of steps to prevent memory from building up. Finally, the profiling data is only written to disk after the stop command.

The profiling tab of Tensorboard offers a host of options to understand your training loops characteristics. Some important ones are :-

Performance Summary — As the name suggests, this gives a basic overview by dividing each step’s total time into different categories such as Average Step Time, Host Compute Time, Device Compute Time etc. Here the “Host” refers to the machine running tensorflow, while “Device” refers to the GPU.
Trace viewer— This offers a detailed view of each step and all the system/GPU calls that are made including loading data to device, downloading data from devices and a whole range of compute steps.

Performance Summary

The above image shows that on average, a step takes 359.0 ms to run (I had set the batch size to 2). The bulk of the time for each step (i.e. 341.9 ms) was spent on “Device Compute Time”. This means that the GPU stays busy with operations such as computing activation maps and later computing gradients and backpropagating them. Since “Input Time” is 0.0 ms, this means that the model does not actually sit idle while the data is loaded to the GPU. This is a good sign since the compute capacity is being fully utilized. Sometimes, the “Kernel Launch Time” is high, but that simply be fixed by setting the TF_GPU_THREAD_MODE flag to gpu_private .

Another point to notice is the “Op Time Spent on Eager Execution”, seen at the bottom left of the snapshot above. A lower percentage suggests that the code is optimized by embedding the instructions into an optimizable graph-based data structure. Finally, some tensorflow operations seem to be placed on the host and not the device as seen in “TF Op Placement”. Thus, there is still room for improvement.

Trace Viewer

The trace viewer is simply just interesting to look at for a peek under the hood of what Tensorflow does. Note, you may not be able to make some major changes through this, but it may offer some insight into expensive operations. As seen in the snapshot above, each step takes approximately the same time, which shows that model training (at least from a software engineering perspective) is homogenous across steps.

Trace Viewer: Zoomed version of a step while training a small 3D UNet model

This snapshot shows the different operations taking place under a single step. Firstly, we can see that the step time is 364.47 ms which is close to the average of 359.0 ms seen in the “Performance Summary” page. Further analysis reveals the time taken for model inference (ModelUnet3DSmall) as well as for backpropagation (gradient_tape). We can also see which is the most expensive Convolutional Block (Block5_ConvBlock3D) in both model inference and backpropagation.

Conclusion

The main goal of this post was to use Tensorflow’s inbuilt profiler to understand and fix any bottlenecks in our custom training loops. The use of the tf.data.Dataset API along with tf.function decorators on various functions already ensures that our code is optimized. Using this profiler, we have confirmed that our model is not sitting idle and waiting to consume data, thus completely utilizing GPU resources.

More information on profiling pages can be found in the Tensorflow’s documentation.

Note: Assuming there are the difficulties installing CUPTI on your machine, you can check the profiler logs using this file and the command

Special thanks to Ravi Autar 🤟