HuggingFace Processing Jobs on Amazon SageMaker

Prepare text data for your NLP pipeline in a scalable and reproducible way

Heiko Hotz

Published in

Towards Data Science

4 min readAug 18, 2021

What is this about?

The latest version of the SageMaker Python SDK (v2.54.0) introduced HuggingFace Processors which are used for processing jobs. These processing jobs can be used to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation on Amazon SageMaker.

The HuggingFace Processors are immensely useful for NLP pipelines that are based on HuggingFace’s Transformer models. The deep learning containers (DLCs) have all the required dependencies pre-installed and are optimised for typical HuggingFace data transformations like tokenization. In this tutorial we will have a look at these Processors and learn how they can be utilised to prepare text data for training a Transformer model.

As always, the code for this tutorial is available on GitHub.

Hugging Face + SageMaker: A brief overview

In March 2021, Huggingface and AWS announced a partnership that made it much easier to train state-of-the-art NLP models on Amazon SageMaker. With the new Hugging Face Training DLCs, training cutting-edge Transformers-based NLP models has become much simpler. In July, this integration got extended to add easy deployment and inference of Transformers models on SageMaker. And now, in August 2021, the Sagemaker Python SDK added yet another building block to this integration, Huggingface Processors.

Why is this important?

The HuggingFace Processor allows us to prepare text data in a containerized image that will run on a dedicated EC2 instance. This has two principal benefits: (1) For large datasets, data preparation can take a long time. Choosing dedicated EC2 instances allows us to pick the right processing power for the task at hand. (2) Codifying the data preparation via a processing job enables us to integrate the data processing step into a CI/CD pipeline for NLP tasks in a scalable and reproducible way.

Downloading and inspecting the dataset

With that ll being said, let’s get started! Our goal is to prepare a dataset so that, at a later point, a binary sentiment classifier can be trained with this data. This classifier takes takes a text as an input and predicts whether the sentiment in the text was positive or negative. To do so, we will utilise the Amazon Customer Reviews Dataset in this tutorial.

This dataset contains reviews for various categories and in this tutorial we will use the reviews for digital software. We can download the data from a public S3 folder and have a first look:

We can see that there are quite a few columns in this dataset, most of which we won’t actually require for our model. This means we will have to discard those columns. Since we want to train a binary classifier, we also need to convert the star rating into binary values, i.e. 0's and 1's, which will represent negative and positive reviews, respectively. In this tutorial we will use a threshold of 4 stars to convert to the rating to a binary value. That means that each rating with 4 or 5 stars will be marked as a positive review, while each rating below 4 stars will be considered a negative review. To prepare the data for a Transformers model we also want to tokenize the data. Finally, we want to split the data into training, validation, and test data.

All this logic will be captured in the processing script, which you can find here. The focus of this tutorial is about utilising the HuggingFace Processor and not so much the data preparation itself, so I won’t go into further details about the processing script. However, if you have any questions about the script, feel free to reach out!

Using the HuggingFace Processor

Now that we have developed the processing script we can use the Sagemaker Python SDK to kick off a processing job. First we need to define the Processing Job

We define the instance we want to run this on and how many of those instances we want. If the data processing is heavy duty and there is lots of data then provisioning more than one instance might make sense.

Next we need to define the inputs and outputs for the processing job, as well as the parameters for the job:

Finally we can kick off the training job:

Once the job is kicked off we can see it running in the Sagemaker console:

After a few minutes the job has finished. The console will provide many details about the processing job, for example which container image was used and where the processed data is stored on S3. We can also collect these data points programmatically via the SageMaker Python SDK API . For example, we can easily retrieve the S3 locations of the processed data:

Conclusion

In this tutorial we have successfully leveraged Sagemaker Processing jobs to process data for a Transformer model. This processing job can now be incorporated into an NLP pipeline where the processing job kicks off every time new training data arrives. Now that the data is processed and tokenized, it can easily be used to train a HuggingFace Transformer model.