The world’s leading publication for data science, AI, and ML professionals.

Pytorch Training Tricks and Tips

Tricks/Tips for optimizing the training of your deep learning model in Pytorch

Photo by ActionVance on Unsplash
Photo by ActionVance on Unsplash

In this article, I will describe and show the code for 4 different Pytorch training tricks that I personally have found to improve the training of my Deep Learning model.

16-bit precision

In a regular training loop, PyTorch stores all float variables in 32-bit precision. For people who are training their models with strict constraints, sometimes, this can cause their model to take up too much memory, forcing them to have a slower training process with a smaller model and a smaller batch size. However, storing all the variables/numbers in the model in 16-bit precision can improve upon and fix most of these problems, like dramatically decreasing the memory consumption of the model and speeding up the training loop while still maintaining the same performance/accuracy of the model.

Converting all calculations to 16-bit precision in Pytorch is very simple to do and only requires a few lines of code. Here is how:

scaler = torch.cuda.amp.GradScaler()

Create a gradient scaler the same way that I have done above. Do this before you write your training loop.

optimizer.zero_grad()
with torch.cuda.amp.autocast():
   output = model(input).to(device)
   loss = criterion(output, correct_answer).to(device)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

When you are doing backward propagation with loss and the optimizer, instead of doing loss.backward() and optimizer.step(), you need to do scaler.scale(loss).backward and scaler.step(optimizer). This allows your scaler to convert all the gradients and do all the calculations in 16-bit precision.

When you are doing everything with 16-bit precision, there may be some numerical instability that causes some functions that you may use to not work properly. Only certain operations work correctly in 16-bit precision. Here is more information on this.

Progress Bar

Having a progress bar that represents what percentage of the training for each epoch has been done can be very useful. To get a progress bar, we will use the tqdm library. Here is how you can download and import it:

pip install tqdm
from tqdm import tqdm

On your training and validation loops, you have to do this:

for index, batch in tqdm(enumerate(loader), total = len(loader), position = 0, leave = True):

And that’s it. Once you do this for your training and validation loops, you will get a progress bar that represents what percentage of the training your model has completed. It should look something like this:

In the picture, 691 represents how many batches my model had to complete, 7:28 represents the total time that my model took to train/evaluate on the 691 batches, and 1.54 it/s represents the average time it took my model for one batch.

Gradient Accumulation

If you run into a CUDA out of memory error, this means that you have exceeded your computational resources. To fix this, there are several things you can do, including converting everything to 16-bit precision as I mentioned above, reducing the batch size of your model, and reducing the num_workers parameter when creating your Dataloaders:

train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True, num_workers=0)

However, sometimes, switching to 16-bit precision and reducing num_workers may not completely fix the problem. The most direct way to fix the problem is to reduce your batch size, but suppose that you don’t want to reduce your batch size. If you don’t want to reduce your batch size, you can use gradient accumulation to stimulate your desired batch size. Note that another solution to the CUDA out of memory issue is simply to use more than one GPU, but this is an option not accessible to many people.

Suppose that your machine/model can only support a batch size of 16 and increasing it results in a CUDA out of memory error, and you want to have a batch size of 32. Gradient accumulation works by running the model with a batch size of 16 twice, accumulating the gradients computed for each batch, and finally doing an optimizer step after those 2 forward passes and accumulation of gradients.

To understand gradient accumulation, it is important to understand what specific functions are done in training a neural network. Suppose you have the following training loop:

model = model.train()
for index, batch in enumerate(train_loader):
    input = batch[0].to(device)
    correct_answer = batch[1].to(device)
    optimizer.zero_grad()
    output = model(input).to(device)
    loss = criterion(output, correct_answer).to(device)
    loss.backward()
    optimizer.step()

Looking at the code above, the key thing to remember is that loss.backward() creates and stores the gradients for the model, but optimizer.step() actually updates the weights. Calling loss.backward() twice before calling optimizer accumulates the gradients. Here is how you can implement gradient accumulation in PyTorch:

model = model.train()
optimizer.zero_grad()
for index, batch in enumerate(train_loader):
    input = batch[0].to(device)
    correct_answer = batch[1].to(device)
    output = model(input).to(device)
    loss = criterion(output, correct_answer).to(device)
    loss.backward()
    if (index+1) % 2 == 0:
       optimizer.step()
       optimizer.zero_grad()

As you can see, taking the example above where our machine can only support a batch size of 16 and we want a batch size of 32, we essentially compute the gradients for 2 batches and then update the actual weights. This results in an effective batch size of 32.

Doing gradient accumulation with 16-bit precision is very similar.

model = model.train()
optimizer.zero_grad()
for index, batch in enumerate(train_loader):
    input = batch[0].to(device)
    correct_answer = batch[1].to(device)
    with torch.cuda.amp.autocast():
         output = model(input).to(device)
         loss = criterion(output, correct_answer).to(device)
    scaler.scale(loss).backward()
    if (index+1) % 2 == 0:
       scaler.step(optimizer)
       scaler.update()
       optimizer.zero_grad()

Evaluation of your results

In most Machine Learning projects, people tend to manually calculate the metrics that they are using for evaluation, and then report them. Although calculating metrics like accuracy, precision, recall, and F1 is not hard, there are certain instances where you may want to have certain variants of these metrics, like macro/micro precision, recall, and F1, or weighted precision, recall, and F1. Calculating these can be a bit more work, and sometimes, your implementation may be incorrect. To calculate all these metrics, efficiently, fast, and without errors, you can use sklearns classification_report library. This is a specific library that is designed towards calculating these metrics. Here is how you can use it.

from sklearn.metrics import classification_report
y_pred = [0, 1, 0, 0, 1]
y_correct = [1, 1, 0, 1, 1]
print(classification_report(y_correct, y_pred))

The code above is for binary classification. You can configure/use this function for more purposes. The first list represents the model’s predictions, and the second list represents the correct answers. The code above would output:


Conclusion

In this article, I discussed 4 ways to optimize your training of deep neural networks. 16-bit precision reduces your memory consumption, gradient accumulation allows you to work around any memory constraints you may have by stimulating a larger batch size, and the tqdm progress bar and sklearns classification report libraries are two convenient libraries that allow you to easily track your model’s training and evaluate your model’s performance. Personally, I always train my neural networks with all of the training tricks above, and I use gradient accumulation whenever it is necessary.

I hope you found this content easy to understand and informative. If you have any questions, let me know in the comments.


Related Articles