Thoughts and Theory

TPU Training

Harnessing the Power of Dedicated DNN Training Chips

Chaim Rand

Published in

Towards Data Science

20 min readOct 12, 2021

One of the driving forces behind the success of deep learning over the past decade has been the immense computing power offered by Graphics Processing Units (GPUs). Although originally designed for rendering images to display devices, their highly parallel structure enabled training speed-ups of orders of magnitude. Over time GPUs were enhanced to meet the ever increasing demands of DNN training. Today they are the dominant method for training large scale AI. However, over the past few years potential challengers have emerged in the form of new chips specifically designed for training DNNs. These chips (or ASICs — application-specific integrated circuit — as they are more aptly called) can potentially enable accelerated training at a fraction of the cost. While there are a number of specialized AI ASICs already on the market (e.g. see here) and many others in the making (e.g. see here), at the time of this writing only a select few are available to the general public via cloud services. This landscape is expected to change in the near future with recent announcements such as the upcoming availability of Habana Gaudi on AWS and the highly anticipated AWS Trainium.

For modern day machine learning development teams to remain competitive they need to constantly keep their fingers on the pulse with regards to new advances. This includes developing a high level of proficiency when it comes to evaluating new AI chips and their potential applications to their own projects. Unfortunately, adapting your training workload to a new chip can sometimes be challenging. The intention of this blog post is to provide a bit of guidance as to how to tackle this challenge, as well as offer some emotional support along the way. This post is by no means intended to replace the official documentation and tutorials generally available online.

I would like to thank Allen Wang and Yitzhak Levi for their inputs into this post.

Prelude

In this post we recommend breaking down the challenge of adapting your model to a new AI chip into four steps:

High level compatibility analysis: Get an early assessment for whether the properties of your workload align with the chip specifications.
Adjusting your model to run on the new chip: You may need to make some adjustments to your model such as replacing operations that are not supported by the dedicated AI chip.
Optimizing the runtime performance on the new chip: Here is where the fun begins… In order to take full advantage of the chip, you will want to analyze and maximize its utilization.
(Re-)Tuning the model to converge: Chances are that the changes that were required in the previous steps (e.g. increasing the training batch size), may require tuning of the model hyperparameters (e.g. the learning rate) in order to ensure timely convergence.

Obviously, this breakdown is somewhat of a simplification. In reality you may find yourself performing some of these steps iteratively and/or in parallel. For example, you may decide to hand over the task of optimizing the runtime to performance analysis experts while a separate group of data scientists works on tuning your learning algorithm to converge on large training batches (e.g. using large batch simulation on GPUs to minimize cost).

In the next few sections we will demonstrate these four steps in greater detail by applying them to the Google Cloud TPU DNN accelerator. More specifically, we will discuss some of the challenges you might face when attempting to convert your model to run on the Cloud TPUv3–8 (containing 8 TPU cores) using TensorFlow version 2.6. Although we are zoning in on a specific AI chip and specific training framework, many of the considerations we discuss are applicable to other AI chips and other training frameworks as well.

You can start up a Cloud TPU on a compute engine (as described here) or using a managed service such as AI Platform. For the steps described in this post we highly recommend running a Cloud TPU on a compute engine. This will enable greater flexibility in debugging and analyzing performance. At the time of this writing the managed API does not offer the same visibility into the TPU system logs and TPU system performance, does not enable you to capture performance profiles, and does not support offloading data preprocessing using the tf.data.service. When starting up your TPU, make sure to follow the instructions meticulously as there are a few subtleties that are unique to TPU setup (e.g. the TPU dedicated service account).

It is important to keep in mind that the landscape of DNN development remains extremely dynamic. Some of the points that we make might become outdated by the time you read this post. Please make sure to stay atop of announcements of new versions and new tools and be sure to make your design decisions based on the most up to date information available.

Please forgive any inaccuracies you might encounter, or better yet, drop me a line with your corrections.

Step 1 — High Level Compatibility Assessment

Naturally, the first thing you will want to do is to try to develop an early assessment of whether the AI chip of interest is at all relevant to your use case. The effort involved in adapting your model to a new chip can be significant, and the sooner you can rule out a certain dead-end, the better. This initial assessment can usually be derived based on resources available online including system specifications and performance benchmarks:

ASIC Description

A good place to start is with the published description of the dedicated hardware. This will generally include what the capabilities of the training chip are: what model layers and architectures are supported, what floating point types are used, what SW stack is required, how the chip interfaces with the CPU and data storage, whether and how the computation cores interconnect, to what degree the hardware scales to multi-core training, etc. This description can be used to identify potential incompatibilities with your model. You may find that the available memory does not meet the needs of your model size or does not support your training scalability needs. In such cases, go no further.

The Cloud TPU documentation includes extensive information on using TPUs including the TPU programming model and the types of workloads most suited for TPUs. From these resources you might reach the conclusion that TPUs are not right for your model due, for example, to their dependence on custom ops, on high precision arithmetic, or on a large number of element-wise operations.

Beware of Benchmarks

Another resource you might want to take a look at are online performance benchmark comparisons. You will not be hard pressed to find TPU to GPU performance comparisons on a variety of common architectures including ResNet, MaskRCNN, Bert, and more. Unfortunately, figuring out how to apply these results to your own use case can be quite hard.

First-off, it probably goes without saying that benchmarks provided by the chip manufacturer should be approached with a healthy amount of skepticism. But even analyses that you would consider impartial can be extremely difficult to interpret. To an extent, this can be said of benchmark comparisons in any discipline, but it is especially true in the area of deep learning where there are a great number of factors that can have a meaningful impact on the performance. First, there is the runtime environment — the number of CPU cores and their type, the operating system, the driver versions, the operating system, the SW framework type and version — each one of these individual elements can alone impact the performance by tens of percents. Then, there is the model: even the slightest differences between your own model architecture and the most similar model in the benchmark, be it in the graph architecture, the input data format, the preprocessing pipeline, the loss function, or the optimizer, can, once again, have a great impact on the performance. Quality benchmark comparisons will include detailed information regarding the precise properties of the evaluations that were performed. But, they are unlikely to cover all of the many parameters that can impact the performance.

MLPerf: MLPerf is an oft quoted benchmark suite for AI training and inference currently administered by the MLCommons consortium. The rationale behind the creation of the training benchmark, as well as the rules for participation, are detailed in this white paper. From just a cursory review of the benchmark results it is evident that the benchmark is far reaching, covering a wide range of training environments and model architectures. However, as mentioned above, given the many differing parameters per test case, you may find it difficult to deduce clear comparisons between AI chips, let alone apply these results to your own particular use case. I have found the results to be subject to interpretation (see for example this review), especially when taking into account the potential price differences between ASICs (which are not included in the raw comparison results).

For example, all indications from the MLPerf benchmark would lead one to believe that 8 NVIDIA A100 GPU cores would outperform TPUv3–8 (containing 8 TPU cores) by a wide margin. However, recently we worked on a model, not unlike the models covered in the review, on which the TPU run actually matched, and even slightly outperformed our best known configuration for the A100 GPU run. This model was seemingly perfectly aligned to the TPU specifications. At the same time, small changes to this model would have changed its TPU compatibility dramatically and substantially increased the step time. The MLPerf report does not give any indication of these two extreme results.

While we recognize the importance and value of AI training benchmark reviews, we believe it important to acknowledge their limitations in predicting performance beyond the specific test cases included in the review.

System Architecture Specification

If the full system specifications are available, another option you have is to try to project the runtime of your training step by conducting an offline simulation of how your model will run on the dedicated hardware. Such analysis requires intimate knowledge of both the system architecture and the DNN model. I know many people who have made a good living off of creating spreadsheets with precise performance predictions based on parameter matrix sizes, numbers of FLOPS, memory size, L2 cache latency, etc. In my experience such projections tend be very much “hit or miss”, and more often “miss” in the case of complex machine learning workloads.

Jump In the Deep End

When all is said and done, there really is no alternative to getting down and dirty. While I recognize that some may be weary of going down the path of a potential dead end, it is my belief that even if you do not ultimately end up training your current model on a TPU, the expertise that you will develop along the way will almost certainly serve you and your team well as your projects evolve and as additional ASICs become available.

Step 2— Adapting Your Model to Run on TPU

Chances are that you will need to make changes to your AI application in order to succeed in running it on the custom ASIC. The extent of the changes that will be required will depend on a number of factors including the maturity of the ASIC’s software stack and the scope of operations supported. Custom ASICs may impose strict restrictions on the AI SW development platform or version which may require significant adaptations. You should always strive to use the most up to date SW packages as these are likely to contain the most extensive and optimal API support. TPUs enjoy the benefit of a relatively healthy SW stack and a large community of developers (the two of which are often correlated). However, they may still require adaptations to both your data input pipeline and computation graph. We will demonstrate a few examples in the subsections below. First we will point out the potential complexities involved in debugging for custom ASICs.

Debugging for TPU

A significant advantage to GPU training is that a significant part of the model design and debugging can be performed on a CPU. If you are able to compile your model on a CPU then 99 times out of 100 your model will be valid on a GPU (with sufficient memory). This is something that modern day AI developers often take for granted. This is not necessarily the situation with custom ASICs. The ramifications of this are that often times the debugging will need to be done directly on the custom ASIC, which may have implications on cost and/or duration. Alternatively, ASIC manufacturers might provide a simulation framework for identifying and fixing potential issues that could be run on CPU. Ideally, such a simulation framework would also provide pointers on how to maximize the ASIC utilization. Unfortunately, as of the time of this writing, an official TPU simulation framework does not exist. And while the available documentation and tools may be helpful in building a TPU compatible model, you won’t know for sure until you run it on a TPU.

An additional difficulty you might face is making sense of the error messages reported by TPU. As we have noted in the past, deciphering TensorFlow error messages can be hard. Error messages on TPU, both those that are reported to the console as well as those that are accessible via cloud monitoring, tend to be especially cryptic (as of the time of this writing). We will provide a few examples in the sections below.

Updating your Data Input Pipeline

Despite the fact that the data input pipeline runs on the TPU’s host CPU, and not on the TPU itself, the TPU system architecture imposes certain restrictions on operations that can be performed. The standard way of launching a TPU relies on a dedicated VM that communicates with the TPU host over gRPC. One of the consequences of this architecture is that any custom operators or any python based data processing functions are disallowed. This prohibits use of the tf.py_function and tf.numpy_function, which are commonly used to bypass limitations that are imposed by the native TensorFlow API. It also prohibits use of tf.data.Dataset.from_generator API, often used as a way to increase flexibility of the input dataset creation.

Unfortunately, as of the time of this writing, in the case that your input pipeline graph is invalid, you are likely to get an ambiguous gRPC error message, the likes of the one in the block below, and you will be left to hunt for the culprit on your own.

W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 22643, Output num: 7Additional GRPC error information from remote target /job:worker/replica:0/task:0::{"created":"@1632916556.697142899","description":"Error received from peer ipv4:10.11.250.138:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 22643, Output num: 7","grpc_status":3}

Another type of input pipeline error that is common with TPUs and that will result in an equally ambiguous message, is related to the size of the data input. As we will see below, utilizing the TPU efficiently may require much larger batch sizes than you were previously accustomed to. If not properly addressed on the input data pipeline, you may run into CPU memory issues. For example, if each input sample is 1 MB and your global batch size is 4096, then your batch size will be 4 GB. One way to address this is to modify your data pipeline to generate “local” batches rather than “global” batches; that is, to create batches of the per core batch size. This can be done using the tf.distribute.strategy.distribute_datasets_from_function API. In our example, the batch size would be, a much more manageable, 0.5 GB.

TPU-VMs — Salvation is on the Way:

This past June, Google announced a new Cloud TPU architecture called Cloud TPU VM. Contrary to the original architecture, known as TPU Nodes, TPU VMs, allow direct SSH access to the TPU host and do not require an intermediary VM. The implications are profound. Not only does this remove the restrictions on the data input pipeline creation, but it also greatly increases our ability to debug and analyze the performance of the input pipeline. Additionally, removal of the intermediary VM can potentially reduce networking overhead and boost performance.

The new architecture is already available in “preview” mode. In our experience it has not fully matured. But the future looks bright.

Updating your Model Graph

Training your model on TPU may require changes to the computation graph running on the TPU core as well. In this section we will demonstrate changes that may be imposed in order to conform to API restrictions or TPU memory restrictions.

TensorFlow Op Restrictions:
It is not uncommon for ASICs to impose restrictions on the supported ops. These restrictions could come from limitations in the HW implementation or in the supporting SW stack. The TPU documentation includes a list of supported (and unsupported) TensorFlow ops. Unfortunately, (at the time of this writing) this list is self-proclaimed to be non-exhaustive. TensorBoard’s graph visualization tool includes a TPU_compatibility option as shown in the image below, but in our experience this test is not fully reliable. You might receive your first indication of an invalidity in your graph only when you try to run it.

TPU compatibility Graph (Image from GCP online documentation)

The TPU limitations include restrictions on the use of custom operators and operations that can result in undetermined tensor shape:

Custom operations — One of the advantages of training on GPUs is that it enables customization at various levels of the SW stack. You can create your own python based operation (e.g. using tf.numpy_function) or you can create your own GPU kernel in CUDA. This degree of flexibility can be extremely useful in implementing: operations that are not natively supported by TensorFlow, or GPU kernels that are specifically optimized for your use case. These capabilities are lacking in TPU (as of the time of this writing). If your graph includes these kinds of customizations, you will need to replace them with native operations. The block below contains an excerpt from the type of error message you can expect in the case that your graph includes a tf.numpy_function call.

(0) Invalid argument: {{function_node __inference_train_function_117771}} Detected unsupported operations when trying to compile graph cluster_train_function_10723333522471149816[] on XLA_TPU_JIT: PyFunc (No registered 'PyFunc' OpKernel for XLA_TPU_JIT devices compatible with node {{node model/segLoss/PyFunc}}){{node model/loss/PyFunc}}

Operations that result in tensors of undetermined shape — Contrary to GPUs, TPUs disallow some APIs due to their use of tensors of non-static shape. It should be noted that recent TensorFlow versions have extended the support of some operations of this type. However, despite being supported, many of these perform quite badly on TPU and you should try to avoid them.

Here is an example of a code excerpt (taken from a previous blog post) that fails on TPU:

shape = [None,1]
dtype = tf.float32
record_tensor = tf.Variable(
            shape=shape,
            # initialize with batch size 1 since batch_size 
            # is unknown, and set validate_shape=False
            initial_value=tf.zeros(shape=[1]+shape[1:],
                                   dtype=dtype),
            validate_shape=False,
            dtype=dtype,
            trainable=False)

To modify it for TPU, we would need to fix the batch_size and change validate_shape to True. The error encountered in this situation is of the likes of:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Dst node should be assigned to an allowed device.

A classic example of an API that results in a tensor of undetermined shape is tf.boolean_mask. Suppose we are working on a segmentation model that takes an image as input and produces a label for each pixel. We may wish to mask out certain areas of the image from our loss calculation (either due to their ambiguity or low level of interest). On a GPU tf.boolean_mask has the effect of removing all operations associated with the loss calculation on these areas and can boost performance significantly. Although TPU support for tf.boolean_mask was recently added to TensorFlow, you are likely to get better performance by calculating the loss on all pixels and just zeroing out the resultant value on areas that are masked out, as demonstrated in the code block below:

# given input logits, lables, mask and loss_fn
if tpu:
    # zero out pixels according to mask
    mask = tf.cast(mask, logits.dtype)
    logits = logits * mask
    labels = labels * mask
else:
    # reduce number of loss_fn operations using tf.boolean_mask
    logits = tf.boolean_mask(logits, mask)
    labels = tf.boolean_mask(labels, mask)
sumLoss = tf.reduce_sum(loss_fn(logits, labels))

TPU Memory Restrictions:
When you load your model to TPU, you might be surprised to find that the amount of memory required for your model greatly exceeds the amount of memory required on a GPU. The reason for this may very likely be memory padding. We will discuss the topic of padding further in the next section. In the code block below we demonstrate the type of error expected in the case that our memory requirements exceed the available TPUv3 memory (16 GB per core). We have chosen an extreme case where padding increases the memory utilization by roughly 3X, from 6.4GB to over 19 GB.

(0) Resource exhausted: {{function_node __inference_train_function_80576}} Ran out of memory in memory space hbm. Used 19.46G of 15.48G hbm. Exceeded hbm capacity by 3.98G.
Total hbm usage >= 19.98G: 
    reserved 530.00M 
    program 19.46G
    arguments 0BOutput size 0B; shares 0B with arguments.Program hbm requirement 19.46G:
    global 276.0K
    scoped 173.0K
    HLO temp 19.46G (33.1% utilization: Unpadded (6.40G) Padded (19.32G), 0.7% fragmentation (147.91M))  Largest program allocations in hbm:
  
  1. Size: 14.00G 
     Operator: op_type="OneHot" op_name="model/loss/one_hot"
     Shape: s32[29360128,10]{1,0:T(8,128)}
     Unpadded size: 1.09G
     Extra memory due to padding: 12.91G (12.8x expansion)
     XLA label: %iota.2 = s32[29360128,10]{1,0:T(8,128)} iota(), iota_dimension=1, metadata={op_type="OneHot" op_name="model/loss/one_hot"}
     Allocation type: HLO temp
     ==========================  2. Size: 2.62G
     Operator: op_name="model/layer_1/Conv2D"
     Shape: f32[128,256,896,18]{0,3,2,1:T(8,128)}
     Unpadded size: 1.97G
     Extra memory due to padding: 672.00M (1.3x expansion)
     XLA label: %fusion.9.remat.1.remat = f32[128,256,896,18]{0,3,2,1:T(8,128)} fusion(f32[1,1,8,18]{3,2,1,0:T(8,128)} %get-tuple-element.4316, f32[18]{0:T(256)} %get-tuple-element.4085, f32[128,256,896,8]{0,3,2,1:T(8,128)} %get-tuple-element.3899, f32[8]{0:T(256)} %rsqrt...
     Allocation type: HLO temp
     ==========================

The error message includes a list of the largest memory allocations. In this case we see that a single op is causing an extra 12.91GB of padding.

Other than redesigning your model to fit memory requirements, one compelling option you have is to compile your model with mixed precision and to set the mixed precision policy to mixed_bfloat16. By default, all variables are stored as tf.float32, a 32 bit floating point representation. Bfloat16 is a 16 bit floating point representation created by Google. See here for details of the format and its dynamic range. When you modify your model to use mixed precision, activations and gradients are stored as tf.bfloat16 while the weights remain in tf.float32. This can reduce the memory requirements of your model considerably and increase runtime performance at the same time.

According to research reported by Google, the convergence of most models is not impacted by the use of tf.bfloat. However, you should be aware of this possibility if you choose this option. We discuss this further in Step 4 of this post.

Step 3— Optimizing Your Model to Perform on TPU

At this point you should be able to successfully run a training cycle on a TPU. Next comes the critical step of performance analysis and optimization. An AI accelerator is only as good as the tools it provides for performance analysis. If you are not able to analyze performance, you will not be able to make the most of the AI chip.

In a previous blog post we expanded on the importance of performance profiling and demonstrated the use of the TensorFlow profiler and TensorBoard. The same techniques can be used to analyze performance on a TPU. The TPU documentation includes a detailed guide on capturing a profile on TPU and analyzing the results in TensorBoard. The documentation also includes a guide for how to design your model so as to optimize TPU utilization. In this section we will highlight just a few performance tips based on our own experience. For more details you should refer back to these two important guides.

Reducing the Overhead of Padding

One of the most important properties to understand about TPUs is how tensors are stored in memory. Failure to adjust your model to the TPU’s memory tiling scheme can result in a significant amount of memory padding overhead which translates into unfulfilled potential. The most important resource when it comes to evaluating your padding overhead is the memory_viewer tab on the TensorBoard profile page. In the image below we show an example of this page. The red curve shows the unpadded memory utilization and the blue curve shows the padded memory utilization. In this example the padding results in a memory footprint that is ~4X the size of memory that is actually used.

The TPU documentation provides guidelines on how to minimize padding overhead. The simplified version is this:

Use a (per core) batch size that is a multiple of 128, and
Set the dimension of the output features of each of your layers to a multiple of 8.

Of course this might not be possible for all models, in which case there are additional recommendations as described in the documentation. Don’t forget that, as we saw above, you have the option of enabling mixed precision in order to increase the likelihood of fitting a batch of size 128 into TPU memory.

Optimizing the Input Data Pipeline

One of the parameters by which the compatibility of a training system to a given model can be measured is by the ratio between the capacity of the host to feed training batches into the accelerator (measured, for example, by batches per second) and the capacity of the accelerator to process the input batches. This ratio can be impacted by the IO bandwidth, the amount of data processing operations in the input pipeline, the number and type of CPU cores, and the speed of the accelerators. If the ratio is lower than one, you are likely to experience an input pipeline bottleneck. In this case the accelerator will remain idle while it waits for input data and precious computation cycles will be wasted. This is an undesirable situation which we have expanded on in a previous post.

Overcoming Data Preprocessing Bottlenecks with TensorFlow Data Service, NVIDIA DALI, and Other…

Maximize Training Resource Utilization, Accelerate Learning, Save Money

towardsdatascience.com

The likelihood that you will encounter a bottleneck on the data preparation pipeline may increase when training on TPU due to the high speed at which it consumes data. Even if your input pipeline does not include heavy processing, the sheer volume of data being parsed, shuffled, and batched can easily choke the CPU resources. Offloading to auxiliary CPUs by using the tf.data.service might provide some relief. (See this example for how to use with TPUs). However, it will not help in all situations and you may need to resort to more creative solutions including: tuning the number of processes assigned to different portions of the pipeline, changing the format and/or precision of your data, or moving computations onto the TPU.

Increasing steps_per_exec

Recently, TensorFlow added the steps_per_exec flag to the tf.keras.Model.compile input parameters. This setting controls the number of training steps to run on each call to the internal train function. Increasing this number can reduce communication overhead between the TPU and the host and ultimately lead to increased performance. One thing to keep in mind though, is that this will impact the intervals at which you enter your training callback functions. For example, if your callback class keeps track of the number of batches that were fed during training, then this value should be incremented by steps_per_exec rather than 1 each time the on_batch function is called. See the documentation for more details.

The value you will find in a custom AI will be greatly determined by your success in optimizing your model to use it efficiently. Such optimizations can take time and effort, and should be planned accordingly.

Step 4 — Tuning Your Model to Converge on TPU

If you have reached this step then you have already succeeded in running your model on the dedicated AI ASIC at a speed that you find satisfactory. It is likely that you may have made some changes to your model in order to reach this step. You may have replaced a few operators, reformatted your data, increased your batch size, or applied mixed precision. The last step is to verify that your model succeeds to converge. This may require tuning your model hyperparameters, adjusting your learning rate, or replacing your optimizer.

This last step is required even if you have not made any changes at all. This is due to the fact that different HW accelerators are implemented differently potentially leading to the possibility of numerical differences in their behaviors. Convergence on one ASIC does not guarantee convergence on another. For example, the high performance of TPUs is attributed to its use of the lower precision floating point type, bfloat16 (see here). Your model may be sensitive to this precision drop in which case you will need to retune it to converge.

Batch Size Increase

In order to make the most of the TPU, it is possible that you will need to increase your batch size higher than you are accustomed to. Your model might be sensitive to the training batch size and tuning it to converge (on the same number of data traversals) may pose a significant challenge. Check out a previous blog post of ours for more details on this topic.

A Guide to (Highly) Distributed DNN Training

What to look out for when scaling your training to multiple workers