High-performance Inferencing with Transformer Models on Spark

A tutorial with code using PySpark, Hugging Face, and AWS GPU instances

Published in

Towards Data Science

5 min readNov 18, 2021

Want up to 100x speed-up and obtain 50% cost savings with Hugging Face or Tensorflow models? With GPU instances and Spark, we can run inferencing on just two or up to hundreds of GPUs concurrently, giving us even more performance effortlessly.

Transformers, GPUs and Spark — Inferencing at Scale — A tutorial — Image by the author using licensed content from Canva

Overview

Setup Driver and Worker instances
Partitioning data for parallelization
Inferencing with Transformer Models
Discussion

Setup Driver and Worker instances

For this tutorial, we will be using DataBricks, and you may sign up for a free account if you do not already have access to one. Take note that DataBricks will need to connect to a cloud hosting provider like AWS, Google Cloud Platform, or Microsoft Azure to run GPU instances.

For this exercise, we will use the AWS GPU instances of type “g4dn.large”. You may still follow these instructions if you are using Google Cloud or Microsoft Azure if you select an equivalent GPU instance available from them.

Once your DataBricks account is set up, log in and create a cluster with the configuration shown below:

Configuring a GPU cluster imaginatively named “gpu_cluster.”

Next, create a notebook, attach it to the cluster by selecting it in the dropdown menu:

Now, we are all set to code.

Installing Hugging Face Transformers

Hugging Face - The AI community building the future.

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Firstly, let us install the Hugging Face Transformers to the cluster.

Run this in the first cell of the notebook:

%pip install transformers==4.2

Hugging Face Transformers Python library installed on the cluster

Libraries installed this way are called Notebook-scoped Python libraries. It is convenient and has to be run at the start of a session before other code because it resets the Python interpreter.

At this point, we start with actual Python code. In the next cell, run:

If the above line ran without any errors, congratulations, Hugging Face Transformers is successfully installed.

Partitioning data for parallelization

The easiest way in Spark to create data that can be processed in parallel is by creating a Spark DataFrame. For this exercise, a DataFrame with two rows of data will suffice:

The created Spark DataFrame is displayed.

The Transformer model for this exercise takes in two text inputs per row. We name them “title” and “abstract” here.

For the curious, here is an excellent article by Laurent Leturgez which dives into Spark partitioning strategies:

On Spark Performance and partitioning strategies

When you are working on Spark especially on Data Engineering tasks, you have to deal with partitioning to get the best…

medium.com

Inferencing with Transformer Models

We shall use the fantastic Pandas UDF for PySpark to process the Spark DataFrame in memory-efficient partitions. Each partition in the DataFrame is presented to our code as a Pandas DataFrame, which you will see below, is called “df” as a parameter of the function “embed_func.” A Pandas DataFrame makes it convenient to process data in Python.

Code that defines embed_func(df)

You might have noticed two things from the code above:

The code further splits the input text found in the Pandas DataFrame into chunks of 20 as defined by the variable “batch_size.”
We use Spectre by AllenAI — a pre-trained language model to generate document-level embedding of documents (pre-print here.) Please take note; however, we can easily swap this for another Hugging Face model like BERT.

Working around GPU memory limits

When the GPU is used to do inference with this Hugging Face Transformer model, the inputs and outputs are stored on the GPU memory. GPU memory is limited, especially since a large transformer model requires a lot of GPU memory to store its parameters. This leaves comparatively little memory to hold the inputs and outputs.

Hence, we control the memory usage by inferencing just 20 rows at a time. Once those 20 rows are done, we copy the output to a NumPy array which resides in CPU memory (which is more abundant). This is done with on Line 21 above with “.cpu().detach().numpy()”.

Finally, the actual Transformer model inferencing on GPU

As mentioned above, this is where the execution of Pandas UDF for PySpark happens. In this case, the Pandas UDF is “embed_func” itself. Do read up on the link above to find out more about this powerful PySpark feature.

The resulting output of this exercise — Spectre gives a 768-long array of floats a the document embedding.

Discussion

I hope you see how Spark, DataBricks, and GPU instances make scaling up inferencing with large transformer models relatively trivial.

The technique shown here makes it possible to run inference on millions of rows and get it done in a matter of hours instead of days or weeks. This can make running large transformer models on large sets of data feasible in more situations.

Cost Savings

But wait, there is more. Despite costing 5 to 20 times a CPU instance, inferencing done by a GPU instance can actually be more cost-effective because it is done 30 to 100 times faster.

Since we pay for the instance on an hourly basis, time is money here.

Less time spent on plumbing

Data is easily imported into DataBricks and saved as Parquet files on AWS S3 buckets or, even better, Data Lake tables (a.k.a. Hive tables on steroids). After that, they can be manipulated as Spark Dataframes, which, as seen in this article, is trivial to parallelize for transformation and inferencing.

All data, code, and compute is “on-the-cloud” accessible and managed in one place, not to mention that it is intrinsically scalable as the data grows from gigabytes to terabytes which makes this neat solution more “future-proof.”

Collaborate Seamlessly

Being a cloud-based solution means that as the team grows, we can add more people to the project to securely access data and code on notebooks. We can create charts for reports to share with other teams we just a few clicks.

Stay tuned for a Tensorflow take of this article as well as more articles I have planned. If you found this helpful, please follow me, I am a new writer and I need your help. Do post your thoughts and questions if you have any.