PyTorch vs. TensorFlow for Transformer-Based NLP Applications

Deployment Considerations Should Be the Priority When Using BERT-Based Models

Published in

Towards Data Science

8 min readJun 7, 2022

TL;DR: BERT is an incredible advancement in NLP. Both major neural network frameworks have successfully and fully implemented BERT, especially with the support of HuggingFace. However, although at first glance TensorFlow is easier to prototype with and deploy from, PyTorch seems to have advantages when it comes to quantization and to some GPU deployments. This should be taken into consideration when kicking off a BERT-based project so that you don’t have to rebuild your codebase halfway through — like us.

Like many things in the AI sphere, the opportunity lies in how fast you can change and adapt for improved performance. BERT and its derivatives have most definitely established a new baseline. It is large and in charge. (In fact, we’ve recently had so many BERT-based projects launch at the same time that we needed company-wide training just to make sure everyone had the same programming style.)

Another one of our companies recently went through a few headaches related to its Tensorflow-based models that, hopefully, you’ll get to learn from. Below are some of the aspects we learned in this project.

Model Availability & Repositories

Shopping for models sometimes feels like browsing a marketplace. Photo by NICE GUYS on Pexels.com.

If you want to use models whose publications are hot off the press, you’ll still be going through GitHub. Otherwise, you can go straight to transformer model repository hubs, such as HuggingFace, Tensorflow Hub, and PyTorch Hub.

A few months after BERT came out, it was a bit clunky to get it up and running. This is kind of moot now ever since HuggingFace made a push to consolidate a transformer model library. Since most (almost all) models are properly retrievable on HuggingFace, the first and primary source for anything transformers, there are fewer questions these days around model availability.

However, there were certain instances of models being only available on proprietary repositories. For example, the Universal Sentence Encoder by Google seems to still only be available on TensorFlow Hub. (At the time of its release, this was one of the best word and sentence embedding models out there so this was an issue, but it has since been superseded by the likes of MPNet and Sentence-T5.)

At the time of writing, there were 2,669 Tensorflow models on HuggingFace, compared to a whopping 31,939 PyTorch models. This is mainly due to newer models being published as a PyTorch model first; there is an academic preference for PyTorch models, albeit not a universal one.

Takeaway: There are more models for PyTorch, but the main ones are available on both frameworks.

Cross-Library GPU Contention

Pure, unadulterated firepower. Photo by Nana Dua on Pexels.com

It’s no surprise that these leviathanic models have tremendous compute requirements, and GPUs will be involved at various points in both the training and inference cycles. Additionally, you’re probably using these models as part of an NLP/document intelligence pipeline, with other libraries fighting for GPU space during pre-processing or custom classifiers.

Thankfully, there are many popular libraries that already use Tensorflow and PyTorch in their backend, and so playing nice with other models *should* be easy. SpaCy and Flair for example, two popular NLP libraries, run primarily* on Torch (1, 2).

*Note: SpaCy uses Thinc for interchangeability between frameworks, but we noticed more stability, native support, and reliability if we stuck with the base PyTorch models.

It’s much easier to share a GPU between custom BERT models and library-specific models for a single framework. If you can share a GPU, then deployment costs go down. (More on this later in “Quantization”.) In an ideal deployment, there are sufficient resources for every library to be effectively scaled; in reality, the compute vs. costs constraints happen really quickly.

If you’re running a multi-step deployment (let’s say document intelligence), then you’ll have some functions that are improved by moving them to GPU, such as sentencizing and classification.

PyTorch has native GPU incremental usage and usually reserves the correct memory boundaries to a given model. From their CUDA Semantics documentation:

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_reserved() and max_memory_reserved() to monitor the total amount of memory managed by the caching allocator. Calling empty_cache() releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

Compared with TensorFlow, which has a by-default complete memory takeover, you need to specify incremental_memory_growth():

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject to CUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation. To limit TensorFlow to a specific set of GPUs, use the tf.config.set_visible_devices method.

Takeaway: Both frameworks have multi-model deployment capabilities on a single GPU, but Tensorflow is slightly less well managed. Use caution.

Quantization

Quantization primarily involves converting Float64 to Unsigned Int8 or UInt16 for both the reduction of the model size and the number of bits required to complete a single computation, and it is also a well-accepted model compression technique. This is analogous to the pixelation and color loss of images. It also has considerations for the distribution of weights, with both Tensorflow and PyTorch supporting fixed and dynamic range quantization in their general model generation pipelines.

The main reason why quantization is a worthwhile step in model performance optimization is that the typical loss of performance over time (due to increased latency) is costlier than the loss of quality over time (such as a drop in F1). Another way of explaining this is “good now is better than better later”.

We’ve anecdotally seen average F1-score drops of 0.005 after post-training quantization (as opposed to 0.03–0.05 for in-training quantization), an acceptable drop in quality for most of our clients and our main applications, especially if this meant running on much cheaper infrastructure and within a reasonable time frame.

An example: considering the volume of text that we analyze in our AuditMap application, most of the risk insights that we identify are valuable due to the speed at which we're able to retrieve them, signaling to our auditor and risk manager clients with their risk landscape actually look like. Most of our models’ F1-score fall between 0.85 to 0.95, completely acceptable for decision support based on an analysis at scale.

These models do need to train and (usually) run on GPUs to be effective. However, if we wanted to run these models on CPU only, we would need to move away from a Float64 representation to a int8 or uint8 to run within an acceptable time frame. From my experiments and retrieved examples, I’ll limit the scope of my observation to the following:

I have not been able to find a simple or direct mechanism to quantize Tensorflow-based HuggingFace models.

Compare this with PyTorch:

A quick example I wrote of dynamic quantization in PyTorch.

Takeaway: Quantization in PyTorch is a single line of code, ready to be deployed to CPU machines. Tensorflow is…less streamlined.

Bonus Category: Cyclomatic Complexity & Programming Style

So if PyTorch is so well-differentiated in what it offers, why is TensorFlow still a consideration? It’s because code written in TensorFlow has, in my opinion, fewer moving parts — that is to say lower cyclomatic complexity.

Cyclomatic complexity is a software development metric used to evaluate all possible code paths in a segment of code. It’s used as a proxy for comprehensiveness, maintainability, and bugs per line of code. Considering code readability, class inheritance is a cyclomatic step, whereas built-in functions are not. From a machine learning perspective, cyclomatic complexity can be used to evaluate the readability of both model training and inference code.

Continuing down the cyclometric complexity rabbit hole, PyTorch is heavily influenced by object-oriented programming, whereas Tensorflow is (often, not always) more procedural in its model generation flow.

Why do we care? Because complexity breeds bugs. The simpler a library is to use, the easier it is to troubleshoot and fix. Simple code is readable code, and readable code is usable code.

In a PyTorch BERT pipeline, cyclomatic complexity increases happen with dataloaders, model instantiation, and training.

Let’s take a look at public examples of FashionMNIST data loaders.

Here’s PyTorch:

# PyTorch Exampleimport os
import pandas as pd
from torchvision.io import read_imageclass CustomImageDataset(Dataset):
    def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
        self.img_labels = pd.read_csv(annotations_file)
        self.img_dir = img_dir
        self.transform = transform
        self.target_transform = target_transformdef __len__(self):
        return len(self.img_labels)def __getitem__(self, idx):
        img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])
        image = read_image(img_path)
        label = self.img_labels.iloc[idx, 1]
        if self.transform:
            image = self.transform(image)
        if self.target_transform:
            label = self.target_transform(label)
        return image, label

And here’s the TensorFlow prebuilt loader:

# TensorFlow Exampleimport tensorflow as tf
import numpy as npfashion_mnist = tf.keras.datasets.fashion_mnist

(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

Although this is a pre-built function within TensorFlow, it’s illustrative of typical train/test splits.

(Bonus: Here’s someone coding in TensorFlow with a PyTorch influence: Building a Multi-label Text Classifier using BERT and TensorFlow)

Conclusion

If you have GPUs available, you’re typically not going to see any major differences between either framework. However, please keep in mind the above-mentioned edge cases as you might find yourself rebuilding an entire pipeline from one framework to another. Just like I did.

Happy Time-saving!

-Matt.

If you have additional questions about this article or our AI consulting framework, feel free to reach out by LinkedIn or by email.