Amazon SageMaker and 🤗 Transformers: Train and Deploy a Summarization Model with a Custom Dataset

A deep dive on newly released capabilities for end-to-end model training and deployment

Published in

Towards Data Science

7 min readJul 14, 2021

On March 25th 2021, Amazon SageMaker and HuggingFace announced a collaboration which intends to make it easier to train state-of-the-art NLP models, using the accessible Transformers library. HuggingFace Deep Learning Containers open up a vast collection of pre-trained models for direct use with the SageMaker SDK, making it a breeze to provision the right infrastructure for the job.

The main benefit of using pre-trained models — other than having access to ready-made state-of-the-art architecture implementations — is that you can fine-tune them using your own data. Starting from a model that has already been trained on an enormous corpus also means that you don’t need to source a very large dataset, in order to get good performance on your specific use-case.

In this tutorial, I will take you through the following 3 steps:

Fine-tuning a summarization model on a custom dataset: this will entail preparing your data for ingestion, and launching a SageMaker training job;
Deploying the fine-tuned model to SageMaker Hosting services: this will automatically provision a persistent endpoint that hosts your model, from which you can get predictions in real-time.
Launching a Streamlit application on our local machine to interact with your model; this is an open-source framework that lets you create interactive web applications from very simple Python scripts, frequently used for ML and data apps.

We will start by setting up a SageMaker Notebook Instance, where we will run the code for this tutorial. You can follow these instructions to set up your Notebook environment; an ml.m5.xlarge Notebook instance type should do just fine. Go ahead and start up a new Jupyter notebook, with a conda_pytorch_p36 kernel. All the packages we will need come pre-installed in the notebook environment.

If you plan on following the code in this tutorial, now would be a good time to clone this repo into your instance. There are three notebooks (data preparation, fine-tuning, and deployment), which contain the pieces of code that come next. Having done that, let’s kick it off!

Model and Dataset

We will be using the PEGASUS¹ model for the purpose of this tutorial, and a fine-tuning script available in the Transformers Github repository. You can use the script as is with PEGASUS, as well as with a few other sequence-to-sequence models available in the HuggingFace Model Hub, such as BART and T5 (see all suitable options in the Supported Architectures section of the script’s README).

The script exposes configuration and hyperparameter options to the user, while taking care of the appropriate text tokenization, training loop, etc. You can provide your dataset in one of two ways, either 1) specifying a dataset_name (which will be downloaded from the HuggingFace Dataset Hub) or 2) the location of your local data files (test_file, validation_file and test_file); we are interested in the latter. The files are downloaded from Amazon S3 at training time, as we will see.

The Extreme Summarization (XSUM) dataset² contains of ~225,000 BBC articles and their summaries, covering a variety of domains. In the first notebook, we download the dataset, and extract the text body and corresponding summary for every article. We then split the data into train and validation sets, and upload a CSV file to S3 for each. Each line in these files corresponds to text,summary of a different data sample; this format is recognized automatically in the training script (keep the CSV headers). The README above details how CSV or JSONLINES files can be used in a more general way, and how to specify the right column names to use if your file has multiple extra columns.

Fine-tuning

Now we are ready to set up the configuration and hyperparameters for the training job! First, we define some imports, and retrieve our SageMaker execution role, session and default S3 bucket:

Then, we pass our hyperparameters to the HuggingFace Estimator, along with some other configurations:

An Estimator is an abstraction that encapsulates training in the SageMaker SDK; because git support is built-in, we can directly specify the training script name and directory as entry_point and source_dir, as well as the repo and branch in git_config. The PyTorch, Transformers and Python versions correspond to the latest supported in the HuggingFace container at the time of writing.

The maximum batch size computable by this particular model on a ml.p3.2xlarge instance - without running out of GPU memory - is around 2. This capacity will depend on other factors, such as the maximum length we define for input sequences, for example. Since this is a fairly large dataset, if you want to increase the batch size and greatly accelerate training, you can take advantage of SageMaker Distributed Data Parallel Training; this feature has been integrated into the Transformers’ Trainer API, so that you can leverage it with no change to your training scripts, other than the minimal settings you see on the above code snippet.

The hyperparameters will be passed to the fine-tuning script as command line arguments. We define training parameters (such as number of epochs and learning rate), and also the directory within the training container where data will be located. In File Mode, SageMaker downloads your data from S3, making it available in the /opt/ml/input/data/<channel_name> directory. When you call Estimator.fit() - effectively starting the training job - , you provide the <channel_name>, and corresponding file location in S3 for train and validation sets.

When the training job is finished, a model.tar.gz file will be uploaded to your default session_bucket on S3; it will contain the model and tokenizer artifacts that we need to deploy our fine-tuned model, and serve inference requests.

Deploying

Finally, we deploy our fine-tuned model to a persistent endpoint. SageMaker Hosting will expose a RESTful API, which you can use to get predictions from your model in real-time.

In the inference_code directory, you will find the script inference.py. This is important for the process, and it details: 1) how to transform input and output, to and from the endpoint, 2) how the model should be loaded in the inference container, and 3) how to get a prediction from the model. This script must have a specific structure, so that SageMaker knows what piece of your code to use for these different functions.

We first define a HuggingFaceModel , containing the details on model and inference code location:

Notice that we can get the S3 URL of the model artifacts by using huggingface_estimator.model_data; also, entry_point and source_dir specify the name and directory of the inference script (stored on your notebook).

Then, we deploy the model to an ml.m5.xlarge instance:

Once deployment is finished, you can directly use the predictor object to get predictions from your model. Notice that we can pass in any of the model’s generation parameters; in this case, we specify length_penalty- a real number which, if >1, incentivizes the generation of longer summaries, and vice-versa. We’ll use a sample news update for our test:

We can see that when summarizing this specific news story, the change in the length parameter was quite meaningful, given that the shorter summary omitted an important piece of information revealed in the longer one: the reason for the scheduled blackouts.

Launch a simple Streamlit interface

In order to not have to interact with your model by manually running lines of code all the time, at this point you will be deploying a Streamlit UI on your local machine. As you can see in the script streamlit_app.py, Streamlit allows us to create an intuitive and responsive interface in little more than 30 lines of code. Within this code, we will invoke our SageMaker endpoint in a different way than before, using the SageMaker Runtime Boto3 client for Python. You will need to have the proper permissions setup on your local environment, either by configuring your credentials with the AWS CLI, or directly passing them as parameters to the boto3 client.

First, install Streamlit by opening up a new terminal and running:

pip install streamlit

To launch the interface, run the following command:

streamlit run \
streamlit_app.py -- --endpoint_name summarization-endpoint

This will start your application on localhost, listening on port 8501 by default. A browser tab should automatically pop up after you run this command , but just visit http:localhost:8501 if it doesn’t.

This is what the interface looks like:

Now, you can even update/switch the model endpoint powering the application, so that you can live-test it after a re-train or a fine-tune on a different dataset!

Conclusion

In this blog post we saw how to leverage native capabilities of the HuggingFace SageMaker Estimator to fine-tune a state-of-the-art summarization model. Most importantly, we used a custom dataset and a ready-made example script, something you can replicate in order to easily train a model on your personal/company’s data.

We also saw how to easily grab the resulting fine-tuned model, deploy it to a fully managed endpoint, and invoke with it via a simple Streamlit interface.

In a real life scenario, you will easily be able to optimize and extend various parts of this process, like setting up auto-scaling to automatically launch more endpoint instances based on load, attaching a “fraction” of a GPU to your endpoint for accelerated inference, and many more useful features.

I hope you will now be able to take quicker advantage of the plethora of NLP resources available. See you next time!

P.S. Many thanks to Heiko Hotz for motivating and supporting the creation of this blog post.

References

[1] J. Zhang, Y. Zhao, M. Saleh, P. Liu, PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization (2020) Proceedings of the 37th International Conference on Machine Learning

[2] S. Narayan, S.B. Cohen, M. Lapata, Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization (2018) Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing