This past October AWS announced the arrival of the Amazon EC2 DL1 instance type. Powered by 8 Habana Gaudi accelerators, DL1 is the first AWS instance type to include dedicated AI accelerators that are not GPUs. Named for the renowned Catalonian architect Antoni Gaudí, Habana Gaudi is a novel AI ASIC that was specifically designed from the ground-up for deep learning workloads. This presents the potential of increased resource utilization and reduced cost of training. And indeed the DL1 instance was released to the world along with the potential promise of "up to 40% better price performance than the current generation of GPU-based instances".
In this blog post we will evaluate the DL1 instance and demonstrate some of its unique properties. This is a sequel to a previous post in which we discussed the potential of using dedicated AI chips and some of the potential challenges of adopting them. There we recommended breaking down the task of migrating your training application to a new AI chip into four steps:
- High level compatibility analysis: Get an early assessment for whether the properties of your workload align with the chip specifications and the supporting software stack.
- Adjusting your model to run on the new chip: You may need to make some adjustments to your model such as replacing operations that are not supported by the dedicated AI chip.
- Optimizing the runtime performance on the new chip: In order to take full advantage of the chip, you will want to analyze and maximize its utilization.
- Tuning the model to converge on the new chip: Some modifications to the model hyperparameters may be required in order to ensure timely convergence.
These steps are described in detail in our previous post. In this post we will follow these steps in our evaluation of the DL1 instance.
This blog post and the code snippets we include are based on the most recent software stack available at the time of this writing, version 1.2.0. Given the relative novelty of the Habana Gaudi offering, new version releases are likely to include important enhancements and optimizations. It is essential that you use the most up to date software stack available and make sure to reevaluate some of the statements we make, accordingly. While we will focus on TensorFlow version 2.7, most of what we write will be just as relevant for other machine learning frameworks supported by Habana Gaudi.
1. High Level Compatibility Assessment
The intention of this step is to collect as much published information as possible in order to assess whether your training needs are addressed by the DL1 offering. This includes online resources such as the following:
- System Architecture Specifications: The DL1 hardware details along with the Habana Gaudi architecture guide should give you a general idea of the machine’s training capabilities. In particular, you can verify that the memory, computation, and other hardware resources are compatible with your needs.
-
Software Documentation: Habana Gaudi comes with a comprehensive software stack called the SynapseAI Software Suite. The developer documentation includes details of the supported frameworks and versions, API limitations, and more. There are several user guides that demonstrate how to use the many features of the API suite. Use the software documentation to verify that the supported frameworks, versions, and operations, meet the needs of your Machine Learning project.
- Benchmarking Reports: Habana have shared performance benchmark results on a wide variety of popular model architectures. You can compare these to the performance results of other AI accelerator vendors . You can also check out MLPerf, a popular benchmark suite for AI training that compares the performance of multiple AI accelerators, including Habana Gaudi, on a variety of workloads. A summary of the latest MLPerf report (at the time of this writing) can be found here. The benchmark results can give you an idea of the types of models where Gaudi excels. However, as we have cautioned in our previous post, unless the model you are training is identical to one of the models included in the benchmark report, predicting the performance of your own model based on the reported benchmarks may not be so easy. This is due to the fact that small changes in the model can have a meaningful impact on its runtime performance.
As with any other novel HW offering, there really is no better way to get a true feeling for the DL1 capabilities other than to go ahead and start to use it. Yes, you do run the risk of investing into a potential dead end, but it is our belief that the knowledge and skills that you develop along the way will serve you well even if you do not end up training your current model on a DL1.
In the bullets below we summarize some of the main highlights of the Gaudi based DL1 offering when compared to other AI accelerators and training instances. These are followed by some of the potential concerns. These are based on our own personal impressions. Our list is by no means comprehensive and should not be viewed as a replacement for the official documentation.
- Heterogeneous architecture: A single Gaudi core, sometimes referred to as an HPU (Habana Processing Unit), is comprised of a cluster of Tensor Processing Cores (TPC) with a configurable Matrix Math engine (GEMM). The GEMM excels at matrix multiplication while non-linear and elementwise operations run on the TPC. This heterogeneity enables Gaudi to reach high efficiency on a large variety of workloads. Maximum utilization can be achieved by balancing the load between the resources effectively.
- High scale training: The architecture design places particular attention to the speed of data throughput between Gaudi processors. This allows Gaudi to demonstrate exceptional, near linear results when scaling training to multiple cores.
- Framework support: The SynapseAI APIs include support for both TensorFlow and PyTorch, two of the most popular machine learning frameworks in use today. It also includes support for Horovod, a popular framework for distributed training. These offerings make model creation for Gaudi extremely accessible to the modern day machine learning developer.
- Rich model garden: The Habana SW offering includes a wide variety of reference models – implementations of popular models that have been ported and optimized for running on Gaudi.
- Custom kernel creation: Contrary to some other dedicated AI ASICs, the SynapseAI SW suite includes tools for implementing custom Gaudi (TPC) kernels. Similar to the CUDA toolkit for GPU kernel development, this capability empowers users to design, develop, and optimize low level operations that are specifically tuned to their workload needs.
- Run parallel trials: The SW suite supports running parallel workloads on disjoint subsets of the eight underlying Gaudi accelerators on the DL1 instance. One way to do this is to use a training orchestrator such as kubernetes. We will demonstrate how to utilize this capability to parallelize trials for hyperparameter tuning.
- CPU to Gaudi compute ratio: In a standard machine learning project, computation will be divided between the CPU resources and the AI accelerators. Typically, the input data preprocessing pipeline will run on CPU and the model computation graph (forward backward pass) will run on the AI accelerators. In an ideal situation all of the training resources will be fully utilized. But maximizing utilization is most important for the AI accelerator resources which are generally the most powerful and expensive resources in the system. In some cases, you may find that the CPUs cannot keep up with the speed of the AI accelerators leading to a CPU bottleneck and underutilization of the AI accelerators. The likelihood of a CPU bottleneck is determined by the ratio between the overall CPU compute power and the overall accelerator compute power. The DL1 instance has a relatively high CPU to Gaudi compute ratio reducing the likelihood of a CPU bottleneck and accelerator underutilization. Point in fact, the DL1 instance contains the same CPU compute power as the p4d.24xlarge EC2 instance (96 second generation Intel Xeon Scalable CPU cores) even though its AI accelerator, A100, is considered to be more powerful than Habana Gaudi.
- Batch size flexibility: Contrary to other AI accelerators which may require training with particularly high batch sizes in order to take full, price efficient, advantage of the hardware, Habana Gaudi is able to achieve high utilization across a wide spectrum of batch sizes. This makes Gaudi a viable option for models that may not scale well to large batch sizes.
Here are some points that you should take into consideration:
- API limitations: Make sure to read up on the limitations of the SynapseAI software suite. As clearly stated in the documentation at the time of this writing "not all models are supported yet on Gaudi". If you are not able to compile your model on DL1 you can try to tweak your model to align with the API support or explore the option of creating custom operations. Alternatively, you can simply report your findings and keep track of future versions of SynapseAI release.
- Eight accelerators per DL1 instance: At the time of this writing, the only Gaudi based AWS instance offering includes eight accelerators. This means that to fully utilize the system you need to either run parallel experiments or distribute your training across all of the accelerators. If neither of these options are relevant for you then the DL1 instance is probably not your best choice.
- Price performance: You may find that your model is supported but that your initial trials do not result in the price performance that you expect. In this case you can use the Habana optimization guide, performance profiling guide, custom op creation guide, and other resources to increase performance. If despite all your efforts you are not able to reach sufficient price performance, your best option might be to report your findings and wait for an updated version of the SynapseAI suite.
2. Adapting Your Model to Run on DL1
In this section we will discuss some of the steps that are required to get your model up and running on a DL1 instance. These are based on the official Habana documentation, particularly the TensorFlow migration guide. See there for more details.
System Setup
There are multiple ways to start up an Amazon EC2 instance and to set up a DL1 runtime environment. The best option for you will depend on your/your organization’s overall cloud architecture.
Loading the Habana Module
Adapting your TensorFlow model to run on Habana Gaudi requires just two lines of code as highlighted in the code snippet below taken from the Habana TensorFlow Hello World example.
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
When running the script make sure to use the appropriate python executable. This depends on your setup of choice as documented here.
Examine Device Placement
The first thing you will want to do when running your script is to verify that your model actually runs on the Gaudi accelerator. The Gaudi runtime environment includes the hl-smi tool which reports resource utilization of the eight Gaudi cores. It is similar in its look and feel to the nvidia-smi tool for GPUs. You can use the tool to verify that running your training script increases the memory and computation resources of the first Gaudi core. In addition to asserting that the Gaudi core is being used, you will also want to make sure that all of the operations of your training computation graph are running on Gaudi and not on the CPU. Operations that are not supported by Gaudi are offloaded onto the CPU which could lead to high transactional overhead and reduce your training speed significantly. One way to examine the device placement is using the _tf.debugging.set_log_device_placement_ function. When set to True this routine will generate a log with the device placement of all of the TensorFlow ops in your program. Gaudi cores are registered as ‘HPU’ devices in TensorFlow. If all of your training ops are assigned to ‘HPU’ you are in good shape. If any of your training ops are assigned to ‘CPU’ you may want to make adjustments to your computation graph as we discuss in the next subsection. An alternative method for analyzing op placement is documented here.
Example – Using an unsupported data type: In the code snippet below we have added a call to _tf.math.argmax followed by tf.equal_ .
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module
load_habana_module()
tf.debugging.set_log_device_placement(True)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
tf.keras.layers.Lambda(lambda x:
tf.where(tf.expand_dims(tf.equal(
tf.math.argmax(x,axis=-1),2),-1),
x,
tf.math.square(x)))])
loss = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
The default output of _tf.math.argmax_ is of type tf.int64. However, as of the time of this writing, Int64 is not in the list of data types supported by HPU. As a result the _tf.equal_ operation will run on the CPU. The device placement debug log will include the following lines:
sequential/lambda/ArgMax: (ArgMax): /job:localhost/replica:0/task:0/device:HPU:0
sequential/lambda/Equal: (Equal): /job:localhost/replica:0/task:0/device:CPU:0
In this toy example the fix is to simply set the _outputtype of _tf.math.argmax_ to tf.int32.
Model Adjustments
Training your model on HPU may require some changes to your model. The model adjustments that are required will vary in complexity. In some cases, they will be as simple as specifying an underlying datatype, as in the example above. In other cases, you may need to modify your data flow or replace operation sequences. For example, the release notes for the SynapseAI 1.2.0 version include the limitation that "control flow ops such as tf.cond and tf.while_loop are currently not supported on Gaudi". Another example of an operation that, as of the time of this writing, is unsupported on Gaudi is tf.numpy_function, a routine that allows one to include arbitrary python code in your computation graph (e.g. for metric calculation) and is sometimes used to bypass limitations that are imposed by the native TensorFlow API. If your model includes such ops you will need to design an alternative flow or accept the performance penalty of having them run on the CPU.
Three important resources that you will need for making model adjustments for Gaudi are the list of TensorFlow ops that are supported on HPU, the model reference catalog, and the custom kernel creation guide.
Supported TensorFlow ops: Use this document to look up Gaudi support for the operations in your graph.
Model reference catalog: The Habana Gaudi offering includes reference implementations for a wide variety of common machine learning model architectures. These implementations have already been adapted and tuned for running on Gaudi. If you are using one of the implemented architectures, you should count yourself lucky. But even if you are working on a different model architecture, the model reference catalog may include computation layers or computation blocks that you may find useful. For example, if you are working on a transformer architecture, you would be well advised to take a look at the Gaudi specific implementation of the transformer block and either use it as is or derive insights into making appropriate adaptations to your own transformer block.
Custom kernel creation: One of the properties of Habana Gaudi that differentiates it from some of the other dedicated AI ASICs on the market is its programmability by the end user. Habana offers comprehensive guides on creating custom HPU Kernels and wrapping them with TensorFlow operators. Also check out this detailed example and this video tutorial.
Preprocessing pipeline: It should be noted that while porting your model to Gaudi may require changes to the training computation graph that runs on the accelerator, it does not require adjustments to the pre-processing pipeline that runs on the CPU cores. This is contrary to some other AI accelerators as we have seen in the past.
Distributed Training on DL1
Of course, to fully utilize the DL1 resources, it is not enough to port it to run on a single HPU; you will want to take advantage of all eight HPUs. One way to do this is to train on all eight HPUs in parallel using data distributed training. The Habana Gaudi SW stack offers two mechanisms for implementing distributed training. The first uses a Habana Gaudi specific implementation of the popular Horovod framework. The second uses a custom implementation of the tf.distribute.Strategy API. The distributed training guide includes details of the two options. The advantage to choosing the Horovod option is that if you have already implemented distributed training with Horovod for GPUs, no code changes are required in order to run on HPUs. All you need to do is verify appropriate installation of the habana-horovod package. In fact, Horovod support is one of the advantages of the Habana offering when compared with other custom AI ASIC offerings.
Note that as of the time of this writing it is important to adhere to a specific order of import commands as demonstrated in this snippet of code taken from the Habana Gaudi documentation.
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module
# ensure that load_habana_module() needs to be called before
# import horovod
load_habana_module()
import horovod.tensorflow.keras as hvd
#Initialization of Horovod.
hvd.init()
The Horovod framework can also be used to train on multiple DL1 instances in parallel.
3. Optimizing Your Model Performance on DL1
At this point you should be able to successfully run a training cycle on a DL1. Next comes the critical step of performance analysis and optimization. As we stressed in our previous post, an AI accelerator is only as good as the tools it provides for performance analysis and optimization. If you are not able to analyze and optimize performance, you will not be able to make the most of the AI chip.
Habana provides three important resources for performance analysis and optimization, a list of best practices, a Performance Optimization Guide and a Performance Profiler. These guides should be studied in detail and referred to regularly. We will provide some brief commentary on each.
Best Practices for Training on Gaudi
This page includes general (framework agnostic) guidelines for training on Gaudi. Although the list is compact (just seven points at the time of this writing), each item can have a meaningful impact on model performance. Two items warrant mention:
- Dynamic shapes: The use of operators that return shapes of undetermined size is discouraged. See our previous post in which we demonstrated how to replace the use of one such function, _tf.boolean_mask_.
- Tensor shapes: Some of the items recommend that the choice of tensor shapes (such as the batch size and the number of features/channels) adhere to certain formulas. This is not unusual for a dedicated AI accelerator (or for any chip, for that matter). As we mentioned in section 1, other AI chips require the use of large batch sizes to maximize utilization. In this respect Gaudi provides the user with greater freedom/flexibility.
Performance Optimization Guide
This page focuses on optimization guidelines for the TensorFlow framework. One of the recommendations is to take advantage of Gaudi’s built-in support for bfloat16 using TensorFlow’s mixed precision APIs. Using low precision floats (16 bit) during training can potentially reduce both memory utilization and the training step time. There are two low precision floating point formats, float16 and bfloat16, with bfloat16 having a number of properties that make it the preferred format for machine learning. It should be noted that while the reduced memory utilization when using mixed precision is all but guaranteed, the reduced step time, as well as the model’s ability to converge should be verified.
Performance Profiler
The Profiler User Guide contains extensive documentation on the SynapseAI Profiler including its setup, its execution, and tools for analysis. Also available is this helpful video tutorial. As we have discussed in detail in previous posts (such as here and here), profiling the performance of your training is crucial for maximizing the utilization of your training resources, accelerating your training, and reducing training costs.
The primary artifact of the Habana performance profiler is the profiling graph. Similar to the TensorBoard Trace Viewer this graph shows a timeline of the different events that occurred on the different system resources, in particular the DMA, MME and TPC. Here are a few examples of resource usage patterns you might encounter and what can be learned from them:
- Bottleneck on data input pipeline: Large gaps between training steps on the HPU could indicate that the HPU is remaining idle while it waits for training data to be passed by the CPU. In this case you should work on optimizing your data input pipeline. (See here.)
- Offloading operations on to the CPU: HPU idleness in the middle of a training step coupled with heightened DMA activity, could indicate that some of the graph operations are being offloaded onto the CPU. In this case you should reexamine the device placement of your graph operations.
- Imbalance between the MME utilization and HPC utilization: You may find periods during the training step where the MME is idle while the HPU is very busy or vice versa. In this case you can try to reduce the step time by improving the load balancing between the resources. This can be done by programming/designing equivalent device specific kernels as suggested here.
At the time of this writing use of the Habana Gaudi performance profiler requires Gaudi specific configuration steps and tools. Our expectation is that upcoming releases will include refinements to the profiler usage including its full integration into the TensorFlow profiling APIs and TensorBoard.
4. Tuning Your Model to Converge on DL1
At this point your model has been adapted and tuned to your satisfaction and you are ready to train. You may have needed to make some changes to your model that require re-tuning of your hyperparameters to ensure model convergence. Such changes might include replacing certain ops, changing control flows, or changing underlying data types. Even if you have not made any changes to your model, you should ensure that your training converges on the new AI ASIC. This is due to the fact that different hardware accelerators are implemented differently and are likely to exhibit small numerical differences in their behaviors. Convergence on one ASIC does not guarantee convergence on another.
Example – Hyperparameter Tuning on DL1
In this example we demonstrate how to utilize the eight HPU cores to run eight parallel experiments in the context of hyperparameter tuning. Hyperparameter tuning refers to the task of searching for the most optimal hyperparameters for your model. Ray Tune is a popular python library for automating hyperparameter tuning that supports a wide variety of the state-of-the-art optimization algorithms. While the default release recognizes just CPU and GPU as possible "training resources" it can be extended to use HPU as well. In the code block below, we demonstrate a rather simple way of doing this by registering the HPUs as GPUs. In the code snippet below, which is based on Ray Tune’s documented mnist example, we have highlighted the two changes that are required:
- Explicit registration of eight GPUs via the Ray init command. This is required since the current release of Ray does not recognize the HPU accelerators.
- Setting the _HABANA_VISIBLEDEVICES environment variable according to the value of the _CUDA_VISIBLEDEVICES environment variable upon entry of the train function. This will ensure that each process runs on a separate HPU.
import os
import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.integration.keras import TuneReportCallback
def train_mnist(config):
os.environ['HABANA_VISIBLE_DEVICES'] =
os.environ['CUDA_VISIBLE_DEVICES']
import tensorflow as tf
from habana_frameworks.tensorflow import load_habana_module
from tensorflow.keras.datasets import mnist
from filelock import FileLock
load_habana_module()
with FileLock(os.path.expanduser("~/.data.lock")):
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10)])
loss = tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=config['lr'])
model.compile(optimizer=optimizer,
loss=loss,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128,
verbose=0, validation_data=(x_test, y_test),
callbacks=[TuneReportCallback({
"mean_accuracy": "accuracy"})])
def tune_mnist(num_training_iterations):
sched = AsyncHyperBandScheduler(
time_attr="training_iteration", max_t=400, grace_period=20)
# explicitly init ray with number of accelerators set to 8
ray.init(num_gpus=8)
analysis = tune.run(
train_mnist,
name="exp",
scheduler=sched,
metric="mean_accuracy",
mode="max",
stop={
"mean_accuracy": 0.9,
"training_iteration": num_training_iterations
},
num_samples=8,
resources_per_trial={
"cpu": 12,
"gpu": 1
},
config={
"lr": tune.uniform(0.001, 0.1),
})
print("Best hyperparameters found were: ", analysis.best_config)
if __name__ == "__main__":
tune_mnist(num_training_iterations=1000)
Summary
The availability of a new training instance option is always exciting news, all the more so when it is based on a dedicated AI ASIC. The Habana Gaudi offering, which powers the DL1 instance, appears to have all the makings of a worthy alternative to other AI accelerators on the market today. In particular, its accompanying software stack provides the user with a great deal of flexibility in designing and optimizing machine learning workloads. At the same time, it is important to remember that Habana Gaudi is relatively new and should, thus, be approached with an appropriate mindset. Reaching your optimal results may require patience and resilience. But it is well worth the potential reward. On our own models the increase in price performance met and even exceeded the published 40% mark.
This post has covered just a few aspects of training on the DL1 instance. Be sure to refer to the wealth of online documentation for additional details.
During the research for this blog post I discovered that "Gaudi" means "fun" in German. I cannot think of a better way to describe the experience I have had with DL1 thus far. All I can hope for is that you too have "Gaudi" with your Gaudi.