A First Look at AWS Trainium

Harnessing the Power of Dedicated DNN Training Chips — Part 3

Published in

Towards Data Science

16 min readNov 28, 2022

After much anticipation, Amazon EC2 TRN1 instances are now available for public use. Based on the AWS Trainium chip, the TRN1 instance type was specifically designed for accelerating deep learning training. This post is the third part of a series on training deep learning models on dedicated AI chips. In our previous posts we covered Google’s Cloud TPU and Amazon’s EC2 DL1 (powered by Habana Gaudi accelerators). As with other custom AI chips, Trainium offers the potential for significant cost savings (up to 50% according to the documentation). However, also similar to other custom AI chips, not all models are a good match for Trainium, and some may require adaptations in order to realize the potential savings. In this post we will evaluate the TRN1 instance, demonstrate some of its unique properties, and discuss some of the challenges presented by its use. As in our previous posts, we will break down the task of adapting TRN1 for model training into four steps. See here for more details on these steps.

High level compatibility analysis: Get an early assessment for whether the properties of your workload align with the chip specifications and the supporting software stack.
Adjusting your model to run on the new chip: You may need to make some adjustments to your model such as replacing operations that are not supported by the dedicated AI chip.
Optimizing the runtime performance on the new chip: In order to take full advantage of the chip, you will want to analyze and maximize its utilization.
Tuning the model to converge on the new chip: Some modifications to the model hyperparameters may be required in order to ensure timely convergence.

Training on AWS Trainium is enabled by the AWS Neuron SDK. This blog post and the code snippets we include are based on the most recent software stack available at the time of this writing, version 2.4.

Given the relative novelty of the Neuron SDK offering, new version releases are likely to include important enhancements and optimizations. It is essential that you use the most up to date release of the software stack available and make sure to reevaluate some of the statements we make, accordingly. Although we will emphasize that certain statements that we make are true as of the time of this writing, the same qualification should be applied to everything in this post. We will focus on the Neuron SDK’s PyTorch offering (version 1.11). However, much of what we write will be just as relevant for other machine learning frameworks running on Trainium.

Before getting down to business, there is one important feature that we would like to highlight regarding the types of TRN1 instance offerings.

TRN1 Instance Type Offerings

AWS offers two TRN1 instance types, trn1.2xlarge with a single Trainium chip and trn1.32xlarge with 16 Trainium chips. (Each Trainium chip consists of two cores, referred to as NeuronCores.) This dual offering is not to be taken for granted as instance types that are based on other core architectures sometimes offer only a single, multi-chip option. For example, at the time of this writing the Amazon EC2 p4d family only includes instances with eight NVIDIA A100 GPUs. There are two major implications to the availability of the, single chip, trn1.2xlarge, design. The obvious one is its suitability for smaller training workloads for which a 32-core solution is unnecessary or undesired. The second, no less important implication pertains to our discussion: The smaller trn1.2xlarge instance is the perfect environment for evaluating, adapting, and tuning our models for the Trainium chip. When we tune our models to run in the cloud on NVIDIA A100 GPUs, we have no choice but to do this on an expensive eight-GPU instance, even though a single GPU would have sufficed. In contrast, with Trainium, we can do much of our analysis, experimentation, and exploration on the cheaper single-chip instance, before migrating to one or more of the larger instances for full scale training.

1. High Level Compatibility Assessment

The first step is to collect as much published information as possible in order to assess whether your training needs are addressed by the Trainium offering. During this assessment it is important to distinguish between the capabilities of the Trainium hardware and the capabilities that are supported by the Neuron SDK. As we will see, Trainium supports a number of capabilities, features, and operations that, as of the time of this writing, have not been exposed by the supporting SW stack. If your ML project requires one of these, then you may want to hold off on your Trainium evaluation and keep track of the SW stack development on the public AWS Neuron Roadmap page.

The primary resource for your compatibility assessment should be the official AWS Neuron Documentation. The AWS Neuron SDK and its accompanying documentation support both AWS Trainium and AWS Inferentia, with many of the features supported for one but not the other (at the time of this writing). Each page of the documentation includes a label stating whether it is relevant to Trn1, Inf1, or both. Pay careful attention to these labels.

Model Compatibility

A good place to start is the Model Architecture Fit section. Here you can find a table summarizing the degree to which the Trainium HW and current SW stack support a number of popular models. For a more detailed review of the HW, check out the Trainium Architecture and NeuronCore Architecture pages. These should give you an idea of the HW’s training capabilities and whether the memory, computation, and other hardware resources meet your HW needs. Next, review the software documentation to verify that the supported frameworks, versions, and operations, meet your SW needs. The Neuron SDK Release Notes should provide you with an overview of the current SW support. The SDK supports several popular training frameworks although the level of support varies quite a bit (at the time of this writing). In this post we will focus on the PyTorch support. Be sure to check out the PyTorch Developer Guide for a general review of the PyTorch mode of use as well as the list of supported PyTorch operators. You can keep track of the PyTorch related SDK updates in the PyTorch Neuron Release Notes. Also be sure to check out the examples in the Neuron github repository.

Performance Benchmarks

The TRN1 Performance page provides performance benchmarks for a number of popular deep learning models. These can be compared against the public performance data of other AI accelerators (e.g., NVIDIA GPU) to get a general idea of the potential savings opportunity of Trainium. The list of public benchmarks is limited (at the time of this writing) to NLP models. Caution is advised when trying to predict the performance of other models based on these results as even small changes to the model or to the training environment can impact the performance quite a bit. We recommend running your own performance comparison tests before making any decisions.

MLPerf is a popular benchmark suite for AI training that compares the performance of multiple AI accelerators. However, as of the time of this writing, the latest results do not include Trainium based submissions.

First Impressions

In the bullets below we will summarize some of our own personal impressions of the current TRN1 offering. This list is by no means comprehensive and should not be viewed as a replacement for the official documentation.

Heterogeneous architecture: Each NeuronCore combines the power of four compute engines (Tensor/Vector/Scalar/GPSIMD), enabling it to reach high efficiency on a large variety of workloads.
High scale data distributed training: The architecture design, including the NeuronLink for inter-node connections and EFA support for intra-node connections, allows Trainium to demonstrate near linear results when scaling to highly distributed training.
Framework support: The current version (2.4) of the Neuron SDK supports training on Trainium using the PyTorch framework. TensorFlow support is under development. The Neuron SDK PyTorch support is exposed via the PyTorch/XLA library, where each NeuronCore is an XLA device. There are a number of implications to the use of PyTorch/XLA rather than the standard PyTorch framework; the most notable of them being the use of lazy tensors rather than eager tensors. In a previous post we have expanded on the topic of training with PyTorch/XLA and noted some of its advantages and limitations. The PyTorch/XLA API is implemented by the torch-neuronx PyThon package which, at the time of this writing, is aligned with version 1.11 of the PyTorch/XLA API.
Operator support: The Neuron SDK does not implement the full set of PyTorch/XLA APIs and usage flows. In fact, there are a number of popular operators (e.g., conv and sort), optimizers (e.g., LAMB), models (e.g., ), and base types (e.g. FP16), that are (at the time of this writing) pending support.
Custom kernel creation: One of the strengths of the Trainium chip is its support for creating custom C++ operators. Similar to the CUDA toolkit for GPU kernel development, this feature empowers users to design, develop, and optimize low level operations that are specifically tuned to their workload needs. However, as of the time of this writing, this capability isn’t (yet) exposed by the Neuron SW stack.
Memory pooling: Although each Trainium chip contains 32GB, the memory of all of the chips is pooled together (see here). In particular, this means that you can choose to enable a subset of the available Trainium chips, but still utilize the memory of all of the chips. For example, on a trn1.32xlarge you have the option to distribute your model across 32 workers with 16GB each, 8 workers with 64GB each, 2 workers with 256GB, or a single worker with all 512GB (see here). However, you should weigh the choice of forgoing the use of any of the chips carefully and opt for other alternatives (e.g. model sharding) whenever possible.
Model distribution: Model distribution is a common technique for training a model that is so large, it does not fit into the allocated memory of a single worker. There a number of different strategies for model distribution including tensor parallelism, pipeline parallelism, and fully sharded data parallel (FSDP). The Neuron SDK includes support for tensor parallelism using the Megatron-LM library. However, support for other strategies, particularly FSDP, is pending. (For a brief overview of model distributed training strategies, check out our recent post.)
Managed training support: The TRN1 instance family is supported by Amazon’s managed training service, Amazon SageMaker. Amazon SageMaker offers many conveniences for machine learning model development including managing the setup and configuration of training instances and their automatic termination when the training is completed. This is particularly useful when training on multiple nodes which, in addition to setting up the individual instances, also requires configuration of the internode connections.

While the online documentation can provide a pretty good general idea of the Trainium offering, there is no better way to get a true feeling for its value other than to go ahead and start using it.

2. Adapting Your Model to Run on TRN1

In this section we will review some of the steps that are required to get your PyTorch model up and running on a TRN1 instance. For more details, please see the official Neuron SDK PyTorch documentation.

TRN1 System Setup

There are a number of options for setting up a TRN1 PyTorch environment including, Amazon EC2, AWS ParallelCluster, and Amazon SageMaker. The most direct way, and also the best way to get an initial feel for TRN1, is to set up an Amazon EC2 trn1.2xlarge training instance with the latest AWS Deep Learning AMI, as documented here.

Adaptations to Training Flow

The first step to modifying your script to run on TRN1 is to adapt it to PyTorch/XLA. The required adaptations are detailed in the SDK documentation and are the same as they are for any other type of XLA device (e.g., TPU, GPU, or CPU). In some cases, this may be all that is required to train your model on Trainium. If so, count your lucky blessings. If you are not as fortunate, your model compilation will fail. Note that contrary to some other accelerators (including GPU), unsupported operations will not be automatically offloaded to the CPU. See the PyTorch Neuron Troubleshooting Guide for the types of errors you might see.

Here are a few examples of changes that may be necessary:

Replace Unsupported Data Types: Your model may include data types that are not supported by the Trainium chip or the Neuron compiler (see here). In this case, you may need to adapt your model to the use of different base types.

Replace Unsupported Operations: If your model includes operators that are not supported, you will need to make adjustments to replace them. Below, we will share an example of a model in which we replaced the use of a conv layer with a bitwise-exact alternate flow. Of course, this is not always possible. In the future you will be able to create custom kernels for missing operators, but this is not yet supported.

Remove Tensors with Dynamic Shapes: At the time of this writing, support for tensors with tensors with dynamic shapes is pending. In a previous post we showed an example of how to replace a boolean mask operation with a bit-exact alternative containing fixed-sized tensors.

Multi-Core Training

Whether you are running on trn1.2xlarge or trn1.32xlarge, you should strive to maximize the utilization of all of the NeuronCores. This can be done by either running parallel experiments, each on a single core, or running data distributed training over multiple cores. See the Neuron documentation for instructions on extending your script to support data distribution.

Example — Vision Transformer on TRN1

In the following code block we build a basic Vision Transformer (ViT) model using the timm Python package (version 0.6.11). The patch embedding portion of the default ViT includes a conv layer, which does not appear on the list supported operators. Fortunately, the ViT constructor includes an option for passing in the patch embedding logic, enabling us to replace it was a bit-exact conv-less alternative. (In reality, although not in the list of supported operators, the conv layer will run on TRN1. However, its performance, as of the time of this writing, is not as good as the conv-less option we propose below.)

import torch

def build_model():
  from timm.models.vision_transformer import VisionTransformer
  from torch import nn as nn
  from collections.abc import Iterable

  class LinearEmbed(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_chans=3, 
                       embed_dim=768, norm_layer=None, bias=True):
      super().__init__()
      img_size = img_size if isinstance(img_size, Iterable)\
         else (img_size, img_size)
      patch_size = patch_size if isinstance(patch_size, Iterable)\
         else (patch_size, patch_size)
      self.img_size = img_size
      self.patch_size = patch_size
      self.grid_size = (img_size[0] // patch_size[0],
                        img_size[1] // patch_size[1])
      self.num_patches = self.grid_size[0] * self.grid_size[1]
      self.lin = nn.Linear(patch_size[0] * patch_size[1] * in_chans,
                           embed_dim, bias=bias)
      self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()

    def forward(self, x):
      B, C, H, W = x.shape
      NH = H // self.patch_size[0]
      NW = W // self.patch_size[1]
      x = x.view(B, C, NH, self.patch_size[0], NW, self.patch_size[1]). \
           transpose(3, 4). \
           reshape([B, C, NH * NW, 
                    self.patch_size[0] * self.patch_size[1]]). \
           transpose(1, 2). \
           reshape([B, NH * NW, 
                    self.patch_size[0] * self.patch_size[1] * C])
      x = self.lin(x)
      x = self.norm(x)
      return x

  model_args = {
      "embed_layer": LinearEmbed,
  }

  return VisionTransformer(**model_args)

In the code block below we configure the script to run data distribution, load the ViT model to the Neuron XLA device, and train on a fake dataset for 500 steps.

from torch.utils.data import Dataset
import time, os
import torch
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

# use a fake dataset (random data)
class FakeDataset(Dataset):
  def __len__(self):
    return 1000000

  def __getitem__(self, index):
    rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
    label = torch.tensor(data=[index % 1000], dtype=torch.int64)
    return rand_image, label

def train():
  # Initialize XLA process group for torchrun
  import torch_xla.distributed.xla_backend
  torch.distributed.init_process_group('xla')

  # multi-processing: ensure each worker has same initial weights
  torch.manual_seed(0)

  dataset = FakeDataset()
  model = build_model()

  # load model to XLA device  
  device = xm.xla_device()
  model = model.to(device)

  batch_size = 32
  optimizer = torch.optim.Adam(model.parameters())
  data_loader = torch.utils.data.DataLoader(dataset,
                          batch_size=batch_size, num_workers=4)

  data_loader = pl.MpDeviceLoader(data_loader, device)
  loss_function = torch.nn.CrossEntropyLoss()
  t0 = time.perf_counter()
  summ = 0
  count = 0
  for idx, (inputs, target) in enumerate(data_loader, start=1):
    inputs = inputs.to(device)
    targets = torch.squeeze(target.to(device), -1)
    optimizer.zero_grad()
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    xm.optimizer_step(optimizer)
    batch_time = time.perf_counter() - t0
    print(f'step: {idx}: step time is {batch_time}')
    if idx > 10:  # skip first steps
      summ += batch_time
      count += 1
    t0 = time.perf_counter()
    if idx > 500:
      break

  print(f'average step time: {summ/count}')

if __name__ == '__main__':
  os.environ['XLA_USE_BF16'] = '1'
  train()

# Initialization command:
# torchrun --nproc_per_node=2 python train.py

See also the HuggingFace ViT model (that does include a conv layer) in the repository of Neuron training examples.

3. Optimizing Runtime Performance on TRN1

If you have made it until here, it means that you have succeeded in running your script on TRN1. However, additional steps may be required to get the best performance out of the chip. As we have mentioned before, an AI chip is only as good as the tools it provides for performance analysis and optimization. Unless you are able to analyze and optimize the runtime performance, you will not be able to make the most of the AI chip. In this section we will review some of the tips and tools at your disposal for monitoring the TRN1 resource utilization, identifying performance bottlenecks, and optimizing the training workload. For more on the importance of performance profiling see here.

Monitoring Resource Utilization

The neuron-top utility is a great tool for getting an initial feel for the system resource utilization. The tool provides basic information about the memory utilization, NeuronCore utilization, and vCPU utilization. These can be used to identify basic performance issues such as: one or more idle NeuronCores, a CPU bottleneck, or underutilized system memory.

Ouput of neuron-top command (Source: AWS Neuron SDK documentation)

A more detailed report of the system resource utilization (per application) can be obtained using the neuron monitor tool. See this tutorial for an example of how to extract and monitor system metrics during a training session.

Trainium resource utilization on Graphana dashboard (Source: AWS Neuron SDK documentation)

Profiling Performance

The PyTorch/XLA troubleshooting guide lists tools for measuring the performance of your application. These include generating and analyzing system metrics and profiling with TensorBoard. The TensorBoard profiler is a very useful tool for identifying and resolving bottlenecks in your application. In a previous post we provided a detailed review of the different sections of the profiler report and how they can be used.

TensorBoard profiler Trace View (Source: AWS Neuron SDK documentation)

Optimization tips for training on TRN1

For maximal performance, be sure to follow the best practices defined by the SDK. Neuron includes support for bfloat16 as well as automatic mixed precision. Such methods can both reduce the memory footprint of the model and boost your step time performance. However, be sure to verify that these methods do not hurt your model convergence. See here for more details on the different Neuron floating point types and their performance trade-offs.

Optimizing model compilation

When training with PYTorch/XLA, the machine learning model is compiled to an execution graph which is optimized for the underlying XLA accelerator. Model compilation induces quite a bit of overhead to the training flow and it is a best practice to minimize the number of compilations that are required. A common symptom of the compilation overhead is that the first few training steps take a relatively long time (compared to the subsequent training steps and to standard PyTorch training). This overhead increases with the size of the model. The Neuron SDK includes the neuron_parallel_compile for reducing this overhead.

An additional technique is to preload the Neuron compiler cache. If you are running multiple experiments on multiple instances with the same model architecture and hyperparameters, then rather than recompile the model for each trial, you can compile the model once and simply copy the cache. The code block below demonstrates how to save and load the compiler cache, thus avoiding the overhead of compilation.

import tarfile
import boto3
def save_cache():
  if xm.get_local_ordinal() == 0:
    # create tarball from cache
    tar = tarfile.open('/var/tmp/neuron-compile-cache.tar', "w")
    tar.add('/var/tmp/neuron-compile-cache', 'neuron-compile-cache')
    tar.close()
    s3_client = boto3.client("s3")
    s3_client.upload_file(Filename='/var/tmp/neuron-compile-cache.tar', 
                          Bucket=<s3 bucket>, 
                          Key=f'{<path-pref>}/neuron-compile-cache.tar')

def pull_cache(): # should be called after initializtion dist object
  if xm.get_local_ordinal() == 0:
    s3_client = boto3.client("s3")
    s3_client.download_file(Bucket=<s3 bucket>, 
                            Key=f'{<path-pref>}/neuron-compile-cache.tar')
                            Filename='/tmp/neuron-compile-cache.tar')
    with tarfile.open('/tmp/neuron-compile-cache.tar', 'r') as f:
      f.extractall('/var/tmp/')
  xm.rendezvous('sync after pulling cache')

4. Tuning Your Model to Converge on TRN1

At this point your model has been adapted and optimized to your satisfaction. You are now ready to train. You may have needed to make some changes to your model that require re-tuning of your hyperparameters to ensure model convergence. Such changes might include replacing certain ops, changing control flows, or changing underlying data types. Even if you have not made any changes to your model, you should ensure that your training converges on the new AI ASIC. This is due to the fact that different hardware accelerators are implemented differently and are likely to exhibit small numerical differences in their behaviors. Convergence on one ASIC does not guarantee convergence on another.

There are a number of resources at your disposal for debugging and monitoring your training behavior. The SDK provides guidance on printing tensors that enable you to debug intermediate outputs in the graph. Alternatively, you can run in eager debug mode, where each operation is compiled and executed immediately allowing you to inspect the model at different stages in the same way that you would in standard PyTorch. To monitor the training progress, you can follow the SDK instructions for logging metrics to TensorBoard.

Results

In the table below we display the runtime performance of our ViT model on different instance types. The costs were taken from the Amazon EC2 product details for p4, g5, and trn1. The same tests can also be performed on Amazon SageMaker. (See here for SageMaker pricing details.)

Performance comparison of ViT model (lower is better) — By Author

The best price performance was observed on the dual-NeuronCore trn1.2xlarge instance type. However, the model, in its current form, did not scale well: when moving to the trn1.32xlarge the step time increased by roughly 27%. Note that these comparative results are extremely dependent on the model details and are likely to vary greatly across ML projects. In addition, given the incremental improvements of the SW stack, these results are likely to vary based on the Neuron SDK version.

Update (August 8, 2023) — Over the past few months there have been significant improvements to the Neuron SW stack. Please see this post for more up-to-date performance numbers.

Summary

With the release of Trainium, AWS continues to extend their portfolio of dedicated training instances, providing customers with greater variety and opportunities for cost optimization. The TRN1 instance family is particularly intriguing given that its design was tailored for deep learning. At the same time, due to the novelty of the HW architecture and supporting SW architecture, using Trainium should be approached with an appropriate mindset. Reaching optimal results may require patience and resilience. But, hopefully, the reward will outweigh the effort. To paraphrase a popular quote: “The very best journeys start with a ride on an AWS Trainium”.

This post has covered just a few aspects of training on the TRN1 instance family. Be sure to refer to the wealth of online documentation for additional details.