Text Summarization with GPT2 and Layer AI

Using Hugging’s Face transformers library and Layer ai to fine tune GPT2 for text summarization

Aymane Hachcham
Towards Data Science

--

Photo by Aaron Burden on Unsplash

The Transformer soon became the most popular model in NLP after its debut in the famous article Attention Is All You Need in 2017. The capacity to analyze text in a non-sequential manner (as opposed to RNNs) enabled large models to be trained. The introduction of an attention mechanism proved tremendously valuable in generalizing text.

Before the advent of Deep Learning, previous approaches to NLP were more rule-based, with simpler (pure statistical) machine learning algorithms being taught the words and phrases to look for in the text, and particular replies being created when these phrases were discovered.

Following the publication of the study, numerous popular transformers emerged, the most well-known of which is GPT (Generative Pre-trained Transformer). OpenAI, one of the pioneers in AI research, created and trained GPT models. GPT-3 is the most recent version, with 175 billion parameters. Because the model was so sophisticated, OpenAI decided not to open-source it. People can use it via an API after completing a lengthy registration process.

In this article I’ll give a primer on transformers with a bit of technical background. Then we will use Layer to fetch the pre-trained version of GPT2 to fine tune it for summarization purposes. The dataset we will be using comprises amazon reviews and can be found in Kaggle with the following link:

Overview

  1. Transformers
  2. Amazon Reviews Dataset
  3. Fine tuning GPT2
  4. Assessing our fine-tuned model
  5. Conclusion

I have included code in this article where it is most instructive. Full code base and dataset for this project can be found in my public Colab Notebook or in Github Repo.

Quick Tour of the Transformers Library

Attention Based Models

An attention mechanism in machine learning is based on how our own cerebral cortex works. When we examine a picture to describe it, we naturally focus our attention on a few key places that we know hold crucial information. We don’t examine every detail of the image with the same intensity. When dealing with complicated data to analyze, this approach can help save processing resources.

Similarly, when an interpreter translates material from a source language to a target language, they know which words in the source sentence correspond to which terms in the translated phrase based on previous experience.

Photo by Nadi Borodina on Unsplash

GPT2

The GPT language model was initially introduced in 2018 in the paper “Language Models are Unsupervised Multitask Learners by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, with the goal of developing a system that could learn from previously produced text. It would be able to present multiple options for completing a phrase in this way, saving time and adding diversity and linguistic depth to the text. And it was all done without any grammatical faults.

The main layer of the GPT2 architecture is the Attention layer. Without digging too much into the technicalities of it, I would like to list the core features:

GPT2 uses Byte Pair Encoding to create the tokens in its vocabulary. This means the tokens are usually parts of words.

GPT-2 was trained with the goal of causal language modeling (CLM) and is thus capable of predicting the next token in a sequence. GPT-2 may create syntactically coherent text by utilizing this capability.

GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. The model is chameleon-like — it adapts to the style and content of the conditioning text.

Amazon Reviews Dataset

The dataset as presented in Kaggle has the objective to develop an algorithm that can provide meaningful summaries for Amazon evaluations of gourmet cuisine. This dataset is provided on Kaggle and has over 500,000 reviews.

Customers create a text review and a title for it when they write a review on Amazon. The title of the review is used as the summary in the dataset.

Sample:I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most”.

Summary: “Good Quality Dog Food”

Sample from the dataset uploaded to Layer

There are ~71K instances in the dataset, which is sufficient to train a GPT-2 model.

Before starting the processing phase, we need to set up the connection with Layer. We need to log in and initialize a project:

We then can access the Layer console with our project initialized and ready to log the dataset and the models we will be using.

Capture from the Layer Console

Let’s save the data to Layer:

First we mount the Github repo by cloning it into Colab:

We can now fetch the data from Layer this way:

Perform some data processing

GPT-2’s multitasking capabilities is one of its most appealing features. At the same time, the same model may be trained on many tasks. However, we must use the relevant task designators as outlined in the organizational paper.

The TL;DR symbol, which stands for “too lengthy; didn’t read,” is an ideal job designator for summarizing.

The TL;DR symbol can be used as a padding element to shorten the review text and indicate to the model that the important content ends there.

We also need to acquire the average of the length sentence input:

In our case the average length sums up to 73 words per sentence. We may infer that a maximum length of 100 will cover the majority of the cases because the average instance length in words is 73.

Code for processing data samples may become complicated and difficult to maintain; for greater readability and modularity, we’d like our dataset code to be detached from our model training code. Therefore, we can use the PyTorch Dataset class as a wrapper that will act as a compact module to transform the text reviews into tensors ready for training.

Once the model is imported, we initialize our Dataset and Dataloader components to prepare for the training stage. We can call our tokenizer from Layer:

Fine tuning GPT2

The training process is straightforward since GPT2 is capable of several tasks, including summarization, generation, and translation. For summarization we only need to include the labels of our dataset as inputs. The training part includes building and uploading the GPT2 model to Layer.

The training will be completely performed in Layer and we will use the f-gpu-small fabric which is a small GPU with 48 GB of memory. We start first with the training loop:

Then we build the model with the required parameters. We upload it to Layer:

We then run the training in Layer calling the `layer.run` function:

And the training will start. It could take a long time depending on the epochs and the resources available.

Assessing our fine-tuned model

Once you fine-tuned our model, we can now start processing the reviews following a respective methodology:

  • Step 1: The model is fed a review at first.
  • Step 2: Then from all the reviews that we have a top-k option, one is chosen.
  • Step 3: The choice is added to the summary and the current sequence is fed to the model.
  • Step 4: Steps 2 and 3 should be repeated until either max_length is reached or the EOS token is produced.

Choose the top k predictions from all the yielded predictions of the model:

Then we define our inference method:

We can now test our above code with 3 sample reviews and take a look at the generating summaries.

First, we call our trained model from Layer:

We take some samples from the test data:

Final Results:

Review 1
Text Review: “Love these chips. Good taste, very crispy and very easy to clean up the entire 3 oz. bag in one sitting. NO greasy after-taste. Original and barbecue flavors are my favorites but I haven’t tried all flavors. Great product.

Associated Summary: {‘very yummy’, ‘Love these chips!’, ‘My favorite Kettle chip’}

Review 2
Text Review: “We have not had saltines for many years because of unwanted ingredients. This brand is yummy and contains no unwanted ingredients. It was also a lot cheaper by the case than at the local supermarket.”

Associated Summary: {‘yummy’, ‘yummy’, ‘Great product!’}

Review 3
Text Review: “Best English Breakfast tea for a lover of this variety and I’ve tried so many including importing it from England. After a 20 year search I’ve found a very reasonable price for the most flavorful tea.”

Associated Summary: {‘Wonderful Tea’, ‘The BEST tea for a lover of a cup of tea’, ‘Excellent tea for a lover of tea’}

Concluding Thoughts

The results we’ve obtained so far are pretty much decent considering the amount of training we’ve done. Layer facilitates the whole process of importing models and tokenizers. The option of logging all your model artifacts and datasets feels also really useful since you can track in real-time the effects of your work. There is also the possibility to share your project and collaborate in a team, so it’s definitely worth it.

All the code is hosted in the Google Colab referenced above, please feel free to check it up and try it by yourself.

--

--