The world’s leading publication for data science, AI, and ML professionals.

Upgrade Your DNN Training with Amazon SageMaker Debugger

How to Increase Your Efficiency and Reduce Cost When Training in the Cloud

Making Sense of Big Data

This blog post accompanies a talk I gave at AWS re:Invent 2020, in which I described some of the ways in which my team at Mobileye, (officially known as Mobileye, an Intel Company), uses Amazon SageMaker Debugger in its daily DNN development.

Monitoring the Learning Process

A critical part of training Machine Learning models, and particularly deep neural networks (DNNs), is monitoring one’s learning process. (This is sometimes called babysitting one’s learning process.) Monitoring the learning process, refers to the art of tracking different metrics during training, in order to evaluate how the training is proceeding, and, determine what hyper-parameters to tune in order to improve the training.

On our team, we track a broad range of metrics, which can be broadly divided into three categories:

  • Training metrics are used to measure the rate at which the model training is converging. These include monitoring the losses, and the distributions of the gradients and activations.
  • Prediction metrics measure the model’s ability to make predictions. Common metrics include model accuracy, precision, recall, etc. If you are working on a computer vision problem, then a visualization of your model’s prediction might also serve as a metric.
  • System utilization metrics measure the degree to which the training system resources are being utilized, draw attention to bottlenecks in the training pipeline, and indicate potential ways in which the training throughput can be accelerated.

An example of a tool that we use extensively for monitoring TensorFlow models, is TensorBoard. We use TensorBoard to track losses, generate gradient histograms, measure activation outputs, display confusion matrices, display image visualizations of the model’s predictions, profile training performance and more.

Another tool that we use in the Amazon Sagemaker training environment, are SageMaker metrics. In addition to providing system utilization metrics, the SageMaker startup API, (specifically, the _metricdefinitions argument), provides for defining custom metrics that are tracked during training, and displayed on the SageMaker console.

The importance of having a rich set of monitoring tools cannot be overstated, as it is a key ingredient in the success of any DNN development team.

On this page, we describe an additional tool that is provided by the Amazon SageMaker framework, the Amazon SageMaker Debugger, or SMD, for short. We will review some of its features and how we applied them to one of our training jobs. Along the way, we will highlight some of its unique features.

Amazon SageMaker Debugger, includes three main parts:

  • Tensor Capturing and Analysis
  • Debugger Rules, and
  • The Performance Profiler.

Full documentation is available as part of the online SageMaker documentation. The SMD python code is publicly available on github. Also, checkout the python package page for additional details.

We will demonstrate the SM debugger features on a training session of a model that is learning to perform instance segmentation on an image. We will start by introducing the problem of instance segmentation.

Case Study: Instance Segmentation

A key component of autonomous vehicle systems is the ability to detect and classify road related artifacts in its surroundings, including road users, such as pedestrians, vehicles and bicycles, road semantics such as traffic lights and road signs, and road boundaries such as curbs and construction cones.

One way to do this, in a system based on standard cameras, is by segmenting the incoming image into object regions, in a process known as Scene Segmentation or Semantic Segmentation. Each pixel in the image is tagged with an object label, indicating what type of object it belongs to.

Semantic Segmentation (by Mobileye)
Semantic Segmentation (by Mobileye)

An additional level of complexity is introduced by the problem of Instance Segmentation. In this case we want to know, not just what type of object each pixel belongs to, but also what instance of the object it belongs to.

Instance Segmentation (by Mobileye)
Instance Segmentation (by Mobileye)

Notice, how in this image, each instance of a vehicle, is uniquely identified with its own color label.

But autonomous vehicle systems need even more than instance segmentation. They need to be able to make predictions regarding the movement of each of the objects in the scene, in order to safely decide on a driving course. To calculate and predict the movement of a given object, the AV system needs to track its position in sequential frames. This requires temporally consistent instance identification. In other words, not only do we need to be able to segment a scene into separate object instances, but the object identifications need to remain consistent from one image to the next.

Temporally Consistent Instance Segmentation (by Mobileye)
Temporally Consistent Instance Segmentation (by Mobileye)

One way to build a solution that can perform instance segmentation, is through deep, supervised machine learning, i.e. by defining a deep neural network model, and training it on a (large) set of labeled data. In the next sections, we will describe some of the basic SageMaker Debugger features, applied to the training of this model in the Amazon SageMaker environment.

SMD Tensor Capturing and Analysis

The basic feature of SageMaker Debugger is the ability to capture and record tensors during training. Using the SMD APIs, you can choose from a predefined list of tensor collections, create a custom tensor collection or specify specific, arbitrary, tensors to capture. The predefined list of tensor collections includes model weights, model outputs, gradients, inputs and outputs to each model layer and more. The APIs also allow for filtering the tensors by name using regular expressions. For each collection, you can specify the frequency at which you wish the tensors to be captured. Keep in mind that the number of tensors you mark for recording, combined with the collection frequency, will dictate what the performance overhead of the tensor capturing will be. More on this below.

Debugger Configuration

To set up the SMD capturing utility, fill the debugger_hook_config setting in the TensorFlow object in the SageMaker job startup script. In the code block below, we configure SMD to collect the outputs of our instance segmentation model, as well as all model layer inputs and outputs, every 1000 training steps. The tensors will be written to the output directory of the SageMaker job. Note, that for greater control over the tensor collection setup, you can explicitly import and configure the hook from the smdebug.tensorflow within your training script.

from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig
debugger_hook_config=DebuggerHookConfig(
    hook_parameters={'save_interval': '1000'},
    collection_configs=[CollectionConfig(name="outputs"),
                        CollectionConfig(name="layers") ",
                           parameters={'include_regex': "Input")])])
tensorflow=TensorFlow(entry_point='run.py',
                      debugger_hook_config=debugger_hook_config,
                      rules=debugger_rules,
                      profiler_config=profiler_config,...)

SMD Tensor Analysis

The output tensors can be analyzed using the smdebug.trials APIs. In the code block below, we demonstrate how we pull up the captured input images and the captured predictions of the Instance Segmentation model, and process them in order to assess how the quality of the prediction changes over the course of training.

from smdebug.trials import create_trial
from smdebug import modes
trial = create_trial('s3://.../smd_outputs')
for step in trial.steps(mode=modes.TRAIN):
    frame_batch=trial.tensor(frame_tensor_name).value(step, 
                                                   mode=modes.TRAIN)
    seg_prediction=trial.tensor(seg_tensor_name).value(step, 
                                                   mode=modes.TRAIN)
    inst_prediction=trial.tensor(inst_tensor_name).value(step,
                                                   mode=modes.TRAIN)
    image_overlay=process_output(frame_batch, 
                                 seg_prediction, 
                                 inst_prediction)
    plot_images(image_overlay)

Here is an example of one type of visualization we created, that tracks the evolution of both the semantic and instance segmentation predictions, over the course of training.

Segmentation Evolution (by Mobileye)
Segmentation Evolution (by Mobileye)

Advantages of SMD Tensor Capturing

This feature clearly begs the comparison to the TensorFlow summary APIs, which also supports tensor capturing. The key difference is that SMD supports capturing full tensors, as opposed to TensorBoard that records scalars, histograms and images, but not the raw tensors. In essence, the TensorFlow summary APIs combine the tensor capturing and processing into a single step, while SMD separates them into two steps. This enables greater freedom to analyze and process the output tensors. For example, the helper function in the code block above includes optional control parameters (e.g. thresholds on the output values). When using TensorFlow summaries, these need to be fixed when writing the image. With SMD, we can load the raw output tensors and play around with the output processing controls to see how they affect the results. This enables a more comprehensive understanding of the state of the model training.

It is worth noting that in addition to capturing the raw tensors, SMD APIs include options for recording captured tensors to TensorBoard, in the form of scalars or histograms.

SMD Rules

The second component of the SMD offering is the debugging rules feature. While the tensor capturing feature facilitates monitoring a training session, the debugging rules automate the process of taking action based on the assessment of the training progress. On a team like ours, where we often program training sessions to run for several days, this automation can potentially be a significant cost saver. Sometimes we will run a number of parallel training experiments, with the expectation that some will not converge. Rather than waiting for a failed experiment to run its full course, or relying on the developer to identify the failed run, we can configure a rule to terminate the job. Even a successful training job sometimes requires a configuration update (e.g. learning rate), or "early stopping". The SageMaker debugger offers a number of built in rules, as well as APIs for creating custom rules.

In the code block below, we chose to demonstrate the configuration of the stalled_training_rule. The motivation for such a rule should be obvious to anyone who has ever kicked off a bunch of training jobs before leaving for the weekend, and returned three days later, only to learn that the training jobs (which are still running), all hit a bug and stalled ten minutes after they left the office! Aside from losing three days of training time, imagine having to explain away the costly training expenses that were racked up all weekend long, without any return!

For our Instance Segmentation model, we configure the stalled_training_rule to monitor the loss outputs, which we program to be captured every 1000 training steps. If the rule identifies that half an hour has passed without receiving a loss tensor update, the rule assumes that the training has stalled and will terminate the job. Alternatively, an AWS Lambda function could be defined to restart the training and/or send an urgent email notification.

from sagemaker.debugger import CollectionConfig, Rule, rule_configs
# Loss value updates will be used to monitor training job aliveness
loss_collection=CollectionConfig(name="losses", 
                               parameters={"save_interval": "1000"})
# configure rule to terminate job if no update received
# for half an hour
rule_params={"threshold": "1800", 
             "stop_training_on_fire": "True",
             "training_job_name_prefix": job_name}
debugger_rules=[Rule.sagemaker(
                   base_config=rule_configs.stalled_training_rule(),
                   rule_parameters=rule_params,
                   Collection_to_save=[loss_collection])]
tensorflow=TensorFlow(entry_point='run.py',
                      debugger_hook_config=debugger_hook_config,
                      rules=debugger_rules,
                      profiler_config=profiler_config,...)

SMD Performance Profiler

In a previous post, I expanded on the importance of having a rich set of performance profiling tools, and how effective methodologies for profiling training performance, could lead to meaningful savings in time and cost. I detailed some of the common performance issues and bottlenecks, and showed how to identify these using Amazon SageMaker metrics and TensorFlow profiler.

The recently added profiling functionality of SM debugger, enhances our performance profiling capabilities. Here are some of its most compelling features:

  • SageMaker debugger enables profiling of both system utilization, and framework (TensorFlow) activity. This enables detailed analysis of resource utilization as a function of train step progression.
  • The profiling feature can be programmed to collect runtime statistics throughout the course of the training. This enables measuring how different stages of the training, and in particular, activities performed at periodic intervals, impact performance. This is in contrast to profiling just a small window of train steps.
  • The SageMaker debugger profiling APIs, provide a high level of control over determining the level of invasiveness of the profile probes. You can control the profiling interval, and choose whether to activate advanced profiling capabilities, including input data loading and python profiling, and for what durations.
  • SageMaker debugger performs automated analysis of profiling metrics (using debugger rules) and generates a detailed report of the performance footprint, including recommendations for how to increase throughput and reduce cost.

While a detailed review of how the SageMaker debugger profiling feature can be used to identify bottlenecks in different stages of the training pipeline, is beyond the scope of this post, I will demonstrate its use on our instance segmentation model.

In the code block below, we configure the profiling to probe the system at intervals of 100 milliseconds, and enable the advanced profiling features for a single training step.

We applied this profiling configuration to a training session in which we enable debugger, increase the number of tensors that are being monitored, and increase the frequency of the tensor capturing to once every 50 training steps. Using the profiler, we will measure how this impacts resource utilization.

from sagemaker.debugger import ProfilerConfig,
                               FrameworkProfile,
                               DetailedProfilingConfig,
                               DataloaderProfilingConfig,
                               PythonProfilingConfig
dpc=DetailedProfilingConfig(start_step=5, num_steps=1)
dlpc=DataloaderProfilingConfig(start_step=7, num_steps=1)
ppc=PythonProfilingConfig(start_step=9, num_steps=1,
                          python_profiler="cProfile",
                          cprofile_timer="total_time")
fw_params=FrameworkProfile(detailed_profiling_config=dpc,
                           dataloader_profiling_config=dlpc,
                           python_profiling_config=ppc)
profiler_config=ProfilerConfig(system_monitor_interval_millis=100,
                               framework_profile_params=fw_params)
tensorflow=TensorFlow(entry_point='run.py',
                      debugger_hook_config=debugger_hook_config,
                      rules=debugger_rules,
                      profiler_config=profiler_config,...)

The SMD library provides APIs for analyzing the profiling metrics and visualizing them in different ways. In the image below, we demonstrate how we create a simple graph in which we display the GPU utilization along with the train step.

Displaying the performance metrics in this manner, enables us to see a clear correlation between drops in GPU utilization and the steps at which we capture the monitored tensors. Based on this analysis, we might decide to reconsider the decisions to increase the number of monitored tensors, and decrease the capture intervals. More on the tradeoff between tensor collection and performance overhead in the next section.

This, somewhat simple, analysis, was enabled by the fact that the profiler collected metrics from both the system resources, and the TensorFlow framework, and, that those statistics were collected for the duration of the training. Had our profiling been limited to a small window of training steps, we would likely not have included profiling of the tensor collection step, and connecting between the dips in gpu utilization to the collection interval, might not have been as easy.

Note, that the performance profiling results, including graph visualizations of the performance data, and the automated performance report, are integrated for you into SageMaker studio. Using the GUI controls, you can zoom in and out of the various graphs, to get an in depth picture of your training performance.

SMD Overhead

Naturally, the inclusion of the SMD functionalities will incur some performance overhead. The amount of overhead depends on a number of factors, including the model architecture, the number of tensors collected, the capture interval, and the level of invasiveness of the profiling probes.

In the table below, we provide a feel for the overhead of different configurations of the debugger, by calculating them on our instance segmentation model.

The overhead of the tensor collection is clearly impacted by the number of tensors that are captured, and the capture frequency. Once again, we emphasize the dependence of the overhead on the specific model being trained.

The benefit of the debugger, needs to be weighed against its performance penalty. Whether, and how to configure SM debugger, and at which stages in the development cycle, should be part of your team’s overall strategy for monitoring and profiling model training. One might reach the conclusion that his/her team cannot afford the overhead of the SM Debugger. I believe that our team cannot afford NOT to have the monitoring and profiling capabilities that SMD provides.

Summary

I hope I have succeeded in convincing you of the value of the Amazon SageMaker Debugger. Having a clearly defined set of tools and processes for monitoring training is essential for succeeding at training DNNs. When training in Amazon SageMaker, SMD offers a significant upgrade to these tools.

Keep in mind that AWS is continuing to develop and enhance the tool, adding additional capturing features, additional rules, and more profiling tools. Be sure to read up on the latest supported features and enhancements.


Related Articles