Accelerating PyTorch Training Workloads with FP8

How to make the most of your modern-day GPU

Published in

Towards Data Science

8 min readNov 15, 2023

The past few years have seen revolutionary advancements in the field of AI, perhaps best exemplified by the recent popularity and proliferation of LLM-based applications such as ChatGPT. These breakthroughs have been powered by equally exciting developments in the machinery used to train AI models. New and innovative architectures, sophisticated tensor processing cores, and dedicated HW accelerators have enabled the convergence of AI models of ever-increasing sizes, at faster and faster rates. In this post, we will focus on one particular advancement in AI-specialized HW — the inclusion of dedicated 8-bit floating-point (FP8) tensor processing cores. Appearing in the most modern AI HW architectures (e.g., Nvidia Hopper, Nvidia Ada Lovelace, and Habana Gaudi2) the FP8 tensor cores enable a significant increase in floating-point operations per second (FLOPS), as well as opportunities for memory optimization and energy savings for both AI training and inference workloads.

Taking advantage of the HW-level FP8 capabilities requires appropriate support in the SW stack and development framework that we use to build our AI training and inference applications. In this post we will describe how to modify a PyTorch training script so as to utilize the built-in support for the FP8 datatype of an Nvidia H100 GPU. We will start by providing some motivation for the use of the FP8 datatype. We will then review the FP8-specific PyTorch API support exposed by the Transformer Engine library and show how to integrate them into a simple training script. Although we will not go into the theory behind the use of FP8 for AI training, we will note the potential challenges involved in its use. Last, we will demonstrate the significant optimization opportunities of the FP8 datatype.

Disclaimers

Please do not interpret our mention of any SW component, methodology, or service as an endorsement for its use. The best design for ML development will vary greatly based on the specific details of your own AI workload. Please also keep in mind that the APIs and behaviors of some of the SW packages and components we will mention may change by the time you read this post. You are highly encouraged to evaluate any potential design decisions based on the most up to date HW and SW available.

Motivation

As AI models grow more and more sophisticated, so does the machinery required to train them. The Nvidia H100 GPU, said to support “unprecedented performance and scalability”, is (at the time of this writing) Nvidia’s newest and strongest AI accelerator, purposely designed with the goal of enabling the AI development of the next generation. With the current AI hype in full swing, the demand for these GPUs has been huge (e.g., see here). Accordingly, and unsurprisingly, the cost of these GPUs has been extremely high — perhaps even forbidding for many of our readers. Fortunately, cloud service providers such as AWS, GCP, and Microsoft Azure, offer “pay as you go” (per hour/per second) access to H100 powered machines thereby opening up the opportunity for their use to a much greater community of AI developers.

In AWS, H100 GPUs are offered as a component of the recently announced AWS EC2 p5 instance family. These instances are claimed to “accelerate your time to solution by up to 4x compared to previous-generation GPU-based EC2 instances and reduce cost to train ML models by up to 40%”.

In a recent post we discussed some of the considerations that should go into the choice of an ML training instance. We highlighted the fact that the most optimal instance type will be very-much dependent on the project at hand. Specifically, when it comes to ML training instances — bigger is not always better. This is particularly true of the p5 instance family. True — the p5 will likely out-perform any other instance type — after all, the H100 is an undisputed performance beast. But once you factor in the cost of the p5 ($98.32 per hour for the 8-GPU p5.48xlarge instance — at the time of this writing), you might find other instance types to be more suitable.

In the next section we will train a relatively large computer vision model on a p5.48xlarge and compare its performance to a p4d.24xlarge containing 8 Nvidia A100 GPUs.

Toy Model

In the code-block below we define a Vision Transformer (ViT)-backed classification model (using the popular timm Python package version 0.9.10) along with a randomly generated dataset. ViT backbones come in many shapes and sizes. Here we have chosen what is often referred to as the ViT-Huge configuration — with 632 million parameters — in order to take better advantage of the capacity the H100 has for large models.

import torch, time
import torch.optim
import torch.utils.data
import torch.distributed as dist
from torch.nn.parallel.distributed import DistributedDataParallel as DDP
import torch.multiprocessing as mp

# modify batch size according to GPU memory
batch_size = 64

from timm.models.vision_transformer import VisionTransformer

from torch.utils.data import Dataset


# use random data
class FakeDataset(Dataset):
    def __len__(self):
        return 1000000

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(data=[index % 1000], dtype=torch.int64)
        return rand_image, label


def mp_fn(local_rank, *args):
    # configure process
    dist.init_process_group("nccl",
                            rank=local_rank,
                            world_size=torch.cuda.device_count())
    torch.cuda.set_device(local_rank)
    device = torch.cuda.current_device()
    
    # create dataset and dataloader
    train_set = FakeDataset()
    train_loader = torch.utils.data.DataLoader(
        train_set, batch_size=batch_size,
        num_workers=12, pin_memory=True)

    # define ViT-Huge model
    model = VisionTransformer(
            embed_dim=1280,
            depth=32,
            num_heads=16,
        ).cuda(device)
    model = DDP(model, device_ids=[local_rank])

    # define loss and optimizer
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    model.train()

    t0 = time.perf_counter()
    summ = 0
    count = 0

    for step, data in enumerate(train_loader):
        # copy data to GPU
        inputs = data[0].to(device=device, non_blocking=True)
        label = data[1].squeeze(-1).to(device=device, non_blocking=True)
  
        # use mixed precision to take advantage of bfloat16 support
        with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
            outputs = model(inputs)
            loss = criterion(outputs, label)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()
        
        # capture step time
        batch_time = time.perf_counter() - t0
        if step > 10:  # skip first steps
            summ += batch_time
            count += 1
        t0 = time.perf_counter()
        if step > 50:
            break
    print(f'average step time: {summ/count}')


if __name__ == '__main__':
    mp.spawn(mp_fn,
             args=(),
             nprocs=torch.cuda.device_count(),
             join=True)

We trained this model on both the p5.48xlarge and p4d.24xlarge instance types using the dedicated PyTorch 2.1 AWS deep learning container (763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.1.0-gpu-py310-cu121-ubuntu20.04-ec2).

Unsurprisingly, the p5 step-time performance blows away the p4d performance — 0.199 seconds per step compared to 0.41 — more than twice as fast!! That would mean halving the time to train your large ML models. However, when you take into account the difference in cost ($32.77 per-hour for the p4d vs $98.32 per-hour for the p5 — as of the time of this writing) a completely different story unfolds. The price-performance of the p5 is ~30% worse than the p4d!! This is very far from the 40% improvement that appeared in the p5 announcement.

At this point you might draw one of two possible conclusions. The first possibility is that, despite all the hype, the p5 is simply not the right machine for you. The second is that the p5 could still be viable, but that adaptations would be required to your model in order to take full advantage of its potential. In the next sections we will adopt the second approach and demonstrate how using the FP8 datatype — unique to the p5 instance type — can completely alter the comparative price-performance results.

Integrating FP8 with Transformer Engine

The first thing we should emphasize is that, as of the time of this writing, PyTorch (version 2.1) does not include a native 8-bit floating datatype. To program our script to use FP8 we will use Transformer Engine (TE) a dedicated library for accelerating Transformer models on NVIDIA GPUs. TE (version 0.12) comes preinstalled in the AWS PyTorch 2.1 DL container.

Although the theory behind the use of FP8 for training is beyond the scope of this post (e.g., see here), it is important to be aware that the mechanics of using FP8 are far more complex than the 16-bit alternatives (float16 and bfloat16). Fortunately, the TE imlementation hides all of the messy details from the user. Please see the official documentation as well as this simple example for instructions on how to use the TE APIs. To learn more about what is going on behind the scenes be sure to see the following two video tutorials.

FP8 Training with Transformer Engine | NVIDIA On-Demand

The session will include an introduction to FP8 and mixed precision, an overview of Transformer Engine features, and a…

www.nvidia.com

FP8 for Deep Learning | NVIDIA On-Demand

FP8 is a natural progression for accelerating deep learning (DL) training beyond the 16-bit formats common in modern…

www.nvidia.com

To modify our model to use TE, we wrap TE’s specialized Transformer Layer with a custom transformer block class that conforms to timm’s block layer signature.

import transformer_engine.pytorch as te
from transformer_engine.common import recipe


class TE_Block(te.transformer.TransformerLayer):
    def __init__(
            self,
            dim,
            num_heads,
            mlp_ratio=4.,
            qkv_bias=False,
            qk_norm=False,
            proj_drop=0.,
            attn_drop=0.,
            init_values=None,
            drop_path=0.,
            act_layer=None,
            norm_layer=None,
            mlp_layer=None
    ):
        super().__init__(
            hidden_size=dim,
            ffn_hidden_size=int(dim * mlp_ratio),
            num_attention_heads=num_heads,
            hidden_dropout=proj_drop,
            attention_dropout=attn_drop
            )

Next, we modify the VisionTransformer initialization to use our custom block layer:

  model = VisionTransformer(
      embed_dim=1280,
      depth=32,
      num_heads=16,
      block_fn=TE_Block
      ).cuda(device)

Until now we have not made any H100-specific changes — the same code can be run on our A100-powered p4d instance type. The last modification is wrapping the model forward-pass with a te.fp8_autocast context manager. This change requires a GPU that supports FP8:

with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    with te.fp8_autocast(enabled=True):
        outputs = model(inputs)
    loss = criterion(outputs, label)

A Few Cautionary Remarks Regarding the Use of FP8

Usage of an 8-bit floating-point representation (as opposed to a 16 or 32-bit representation) implies lower precision and a lower dynamic range. These can have a meaningful impact on the attainability and/or speed of your model convergence. Although the underlying TE FP8 implementation is designed to address this challenge, there is no guarantee that this will work for your model. You may need to fiddle with the underlying FP8 mechanics (e.g., using the TE recipe APIs), tune some of the hyperparameters, and/or limit the application of the FP8 to subsections of the model. You might find that despite all of your attempts, your model is simply not compatible with FP8.

Results

In the table below we summarize the results of our experiments on both p4d.24xlarge and p5.48xlarge EC2 instance types, with and without the TE library. For the p5.48xlarge experiments we doubled the batch size in order to increase the utilization of the 80 GB GPU memory. Using FP8 reduces the GPU memory consumption enabling a further increase to the batch size.

We can see that the use of the TE transformer block increased the price-performance on both the p4d (~19%) and the p5 (~32%) instance types. Using FP8 boosts the performance on the p5 by an additional ~20%. Following the TE and FP8 optimizations the price performance of the H100-based p5.48large beats that of the A100-based p4d.24large — although not by a very wide margin (~2%). Taking into account the 3x increase in training speed, we can safely conclude that the p5 would be the better instance type for training our optimized model.

Note that the relatively small increase in price-performance (far lower than the 40% mentioned in the p5 announcement) leaves us wishing for additional H100-specific optimizations… but those will have to wait for another post :).

Summary

In this post we have demonstrated how to program a PyTorch training script to use 8-bit floating types. We further demonstrated how the use of FP8 can be a key factor in getting the best performance out of modern GPUs such as Nvidia H100. Importantly, the viability of FP8 as well as its impact on training performance can vary a great deal based on the details of the model.

This post continues a long series of publications on the topic of optimizing machine learning workloads. Be sure to see some of our other posts on this important topic.