Training BERT at a University

Here’s how we train enormous machine learning models on a random assortment of GPUs (https://github.com/yifding/hetseq)

Published in

Towards Data Science

10 min readDec 10, 2020

Machine Learning Models are Enormous.

If you’re reading this post, you’ve probably heard about the remarkable performance of new machine learning models like BERT, GPT-2/3, and other deep learning models for language, image, audio, and video data.

You may ask: Why do these semi-magical machine learning models perform so well? The short answer is that these models are enormously complex and are trained on an enormous amount of data. In fact, Lambda Labs recently estimated that it would require $4.6 million to train the GPT-3 on a single GPU — if such a thing were possible.

Instead, platforms like PyTorch and Tensorflow are able to train these enormous models because they distribute the workload over hundreds (or thousands) of GPUs at the same time. Unfortunately, these platforms require that each individual GPU system be identical (i.e., they have the same memory capacity and compute performance).

Unfortunately, most organizations not named Google or Microsoft do not have a thousand identical GPUs. Instead, small and medium organizations have a piecemeal approach to purchasing computer systems resulting in a heterogeneous infrastructure, which cannot be easily adapted to compute large models. Under these circumstances training even moderately-sized models could take weeks or even months to complete.

If not addressed, universities and other small organizations risk losing relevance in the race to develop newer and better machine learning models.

To help remedy this situation we recently released a software package called HetSeq, which is adapted from the popular PyTorch package and provides the capability to train large neural network models on heterogeneous infrastructure.

Experiments, details of which can be found in an article (available on ArXiV) published at the 2021 AAAI/IAAI Conference, show that base-BERT can be trained in about a day over 8 different GPU systems, most of which we had to “borrow” from idle labs from across Notre Dame.

Before we introduce HetSeq, we first need a little background:

Typical training of a neural network

Training on single GPU

This code shows the training step of a basic supervised learning model in a neural network framework. Given some architecture, this task optimizes the model parameters via SGD on the loss function between predicted instances and ground truth.

The actual training process is made of four individual steps: (1) data loading, (2) the forward pass, (3) the backward process, (4) update.

(1) Data loading

On a single GPU the first thing that happens is that the existing model parameters (which are initially random) and the data are transferred to the GPU. Typically, the dataset includes a large number of training instances which will not all fit on a single GPU. In this common case we split the dataset into multiple batches and load them one at a time.

Forward pass in single GPU (Image by Author)

(2) The Forward Pass

The next step is to compute the loss function. To do this, the data batch is passed over the model (hence “forward pass”) and compared against the ground truth training labels. In the block, the forward pass has two steps: generate predicted label (output) and measure the difference (loss) between output and target.

(3) The Backward Pass

The loss computed in the previous step determines how much to change the model parameters; this is called the gradient, which is applied backward over the neural network architecture (hence backward pass or backpropagation).

Update parameters in a single GPU (Image by Author)

(4) Update

Remember that the goal of this whole process is to optimize the model parameters so that when the data is passed forward over them, then they will minimize the loss. So it is important that the model parameters are updated according to the values of the gradient.

A brief note on training steps

All together, one iteration of data loading, a forward pass of a single data instance, followed by a backward pass, and then the parameter update is called one step. Once all the data batches in the entire dataset are processed, we say that one epoch has been completed. Finally, it has been shown that the learning rate needs to change as the number of epochs increases.

What if we have multiple GPUs?

Since data batches are independent from one another, it is rather straightforward to parallelize this process by sending different data batches to different GPUs. Then, if we can somehow combine the computed loss and synchronize the updated model parameters, then we can make training much faster.

Distributed Data Parallel class

This isn’t a new idea. In PyTorch, we use the torch.nn.parallel.DistributedDataParallel (DDP)module instead of the torch.nn.Modulemodule for the model. Each GPU is an individual process, and the communication between GPUs occurs with standard IPC.

But that’s not the end of it. The four steps need some tweaking:

(1) Data loading with DDP

With DDP we split each data batch over many different GPUs — as many as we have available. In this case it is critical that each GPU has the same model parameters.

This is the core idea of distributed data parallel (DDP): each GPU has identical model parameters yet process different data batches simultaneously.

Forward pass in multiple GPUs (Image by Author)

(2) Forward pass with DDP

Once different data batches are loaded into different GPUs, the next step is to perform a forward pass and compute the loss functions. Unlike in the single GPU case, we now need to compute the total loss of the all data batches, which is the sum of all the losses across all the GPUs. Because our goal is to compute the average loss for the backward step. It is important to output the number of instances (# of ins.) to the final output. We sum up the losses as well as number of instances.

(3) Backward pass with DDP

We use average loss to obtain gradients of the model parameters by backward pass. Before this begins the average loss needs to be communicated to the different GPUs so the model parameters can stay synchronized. Once the same parameters across different GPU obtain their gradients, gradient synchronization is executed to give them the same gradients.

Gradient synchronization (Image by Author)

(4) Update with DDP

Once the gradients are synchronized, we can update model parameters in parallel using the individual optimizers on each GPU.

The next training step can usually begin immediately. However, because nearly all the parameters are in floatand because computing error can occur in some GPUs, especially when many training steps are performed, we occasionally synchronize parameters at the beginning or end of a step.

These changes are reflected in the following pseudocode. Notably, the function takes device id (i.e., the GPU id), the Model has to perform parameter synchronization before each forward pass, the loss function has to averaged before the backward pass, and finally the gradients need to be averaged before the model parameters are updated.

Training on multiple GPUs

Scaling Up — Multiple Nodes with Multiple GPUs

So far, we’ve talked about how to utilize multiple GPUs on the single node. That’s great, but it’ll only get us so far. If we want to really scale up, we need to distribute the workload across multiple nodes, each having multiple GPUs.

Fortunately, the same mechanism used to address GPUs on a single node can be extended to multiple nodes. You can simply set the node index, i.e., the rankparameter in theinit_process_groupfunction, globally so that each GPU has a unique ID across all nodes.

Communication — Here’s where things get tricky

Intra-node VS inter-node (Image by Author)

When you have multiple nodes with multiple GPUs, communication needs to occur between GPUs on the same node and across different nodes in order to share calculated gradients and parameter updates during the training procedure.

Of course, inter-node communication is muchslower than intra-node communication. And sharing gradients and parameter updates become a complete mess when the nodes and GPUs are not identical — as is the case when you don’t have a billion dollars to spend on a data center with custom compute hardware.

When your parents make you share your toys

At most university-compute centers, various research labs share their computing resources. There are different models for how this is done, but usually the IT administrators take a fair amount of control over the systems and prevent the scientists for installing (or upgrading or downgrading) needed software.

This means that, if a large model needs to be trained, then some poor graduate students need to make the training algorithm fit the infrastructure.

And this is difficult for a few reasons:

Some toys have complicated instructions. The Distributed Data Parallel (DDP) package is a pain to understand and work with. This is especially true for most machine learning researchers who are not well versed in the particulars of distributed computing. In addition to the basic set up of DDP, a peaceful training run over different GPU architectures across many nodes requires careful data splitting and arduous communication between GPUs and nodes.
Some toys are better to play with than others. In a heterogeneous system, some GPUs are faster than others and some have more memory than others. This means that some GPUs get more data to crunch than others, which is fine, but it also means that the gradient averages and parameter updates need to be carefully weighted.
Our parents won’t let us play with some toys. Most existing distributed GPU training platforms require extra packages like Docker and OpenMPI, etc. Unfortunately, most competent cluster administrators won’t allow users to have the administrative privileges that are needed to set up each node to fit the model.
Some toys don’t work well with others. Deep learning packages like BERT and GPT2/3 developed by large companies generally have specific formats for the design model with several logic layers, making it difficult to use and adapt to a custom application.

Because of these problems we created a general system that wraps up all the complicated parts of DDP, data splitting, compatibility, and customizability and deployed this system at Notre Dame.

We call this system HetSeq. It was adapted from the popular PyTorch package and provides the capability to train large neural network models on heterogeneous infrastructure. It can be lightly set up over shared file system without extra extra packages and administrative privileges.

Here’s how to train BERT with HetSeq.

BERT at a University with HetSeq

Let’s get started with Anaconda. We’ll Create a virtual environment and install python.

Then we’ll install packages and HetSeq bindings: we’ll download HetSeq from GitHub, install packages in requirements.txt, as well as HetSeq and bindings from setup.py.

The last step before we train is to download the BERT data files including training corpus, model configuration, and BPE dictionary from this link. Download DATA.zip, unzip it and place it into the preprocessing/directory.

Train BERT with HetSeq

The cool thing about HetSeq is that it abstracts-away all the details about distributed processing. So the training code for 100 GPUs the almost the same same as 1 GPU! Let’s try it out!

In this case let’s assume we have two compute nodes.

On node 1:

On node 2:

The two blocks of code are run on two different nodes. The TCP/IP address needs be set to one of the node’s IP addresses. Once these get started up, you’ll be able to watch the code run across 8 GPUs on 2 different nodes!

🤗🤗🤗

So how well does it work?

We have done some experiments (see https://arxiv.org/pdf/2009.14783 for details) over various homogeneous (hom) and heterogeneous (het) settings.

In total we were able to commandeer 32 GPUs across 8 heterogeneous nodes to reduce the training time for the BERT language model from seven days to about one day. 🤩🤩🤩

Under the hood of HetSeq

The HetSeq package contains three major modules illustrated on the left figure: train.py, task.py, and controller.py to coordinate the main components illustrated on the right. The train.py module initializes the distributed system and its various components.
The task.py module defines the model, dataset, data loader, and optimizer functions; it also executes the forward pass and backpropagation functions. The controller.py module acts as the main training controller. It executes the actual model, optimizer, and learning rate scheduler; loads and saves the checkpoint; communicates the loss; and updates the parameters.

But I don’t want to train BERT!

No problem. You can extend HetSeq to any other model. But you need to define a new Task with the corresponding Model, Dataset, Optimizer and Learning Rate Scheduler. An MNISTexample is given with all the extended classes. Pre-defined optimizers, Learning Rate Scheduler, datasetsand modelscan be reused in other applications. Check out our documentation for more details!

Conclusion

In this post, we introduce the background and actual steps to train BERT from scratch at university by using our released package — HetSeq. It allows us to set up easily on heterogeneous systems with multiple nodes multiple GPUs.

Would you like to learn more?

For more information, please check out HetSeq package (https://github.com/yifding/hetseq) and documentation (hetseq.readthedocs.io). Cheers!

[1] Yifan Ding, Nicholas Botzer and Tim Weninger. HetSeq: Distributed GPU Training on Heterogeneous Infrastructure. Proc. of Association for the Advancement of Artificial Intelligence (AAAI) Innovative Application of Artificial Intelligence, 2021.