The world’s leading publication for data science, AI, and ML professionals.

Setting up a Text Summarisation Project (Introduction)

A practical guide for diving deep into text summarisation with Hugging Face Transformers

Photo by nadi borodina on Unsplash
Photo by nadi borodina on Unsplash

Update (14 Dec 2021): This tutorial has now been published as one long article here.

The arrival of a new age in Natural Language Generation

When OpenAI released the third generation of their machine learning (ML) model that specialises in text generation in July 2020, I knew something was different. This model struck a nerve like no one that came before it. Suddenly I heard friends and colleagues, who might be interested in technology but usually don’t care much about the latest advancements in the AI/ML space, talk about it. Even the Guardian wrote an article about it. Or, to be precise, the model wrote the article and the Guardian edited and published it. There was no denying it – GPT-3 was a game changer.

Once the model had been released, people immediately started to come up with potential applications for it. Within weeks plenty of impressive demos were created, which can be found on the Awesome GPT-3 website. One particular application that caught my eye was text summarisation, i.e. the capability of a computer to read a given text and summarise its content. It combines two fields within the field of Natural Language Processing (NLP), reading comprehension and text generation, and is one of the hardest tasks for a computer. Which is why I was so impressed by the GPT-3 demos for text summarisation.

You can give them a try on the Hugging Face Spaces website. My favourite one at the moment is an application that generates summaries of news articles with just the URL of the article as input.

What is this tutorial about?

Many organisations I work with (charities, companies, NGOs) have huge amounts of texts they need to read and summarise – financial reports or news articles, scientific research papers, patent applications, legal contracts, etc. Naturally, these organisations are interested in automating these tasks with NLP technology. So, in order to demonstrate the art of the possible I often use the text summarisation demos and they almost never fail to impress.

But now what?

The challenge for these organisations is that they want to assess text summarisation models based on summaries for many, many documents – not one at a time. They don’t want to hire an intern whose only job is to open the application, paste in a document, hit the "Summarise" button, wait for the output, assess whether the summary is good, and do that all over again for thousands of documents.

Which brings us to the objective of this blog post series: In this tutorial I propose a practical guide for organisations so they can assess the quality of text summarisation models for their domain.

Who is this tutorial (not) for?

I wrote this tutorial with my past self from four weeks ago in mind, i.e. it’s the tutorial I wish I had back then when I started on this journey. In that sense the target audience of this tutorial is someone who is familiar with AI/ML and has used Transformer models before, but is at the beginning of their text summarisation journey and want to dive deeper into it. Because it’s written by a "beginner" for beginners I want to stress the fact that this tutorial is a practical guide – not THE practical guide. Please treat it as if George Box had said:

Image by author
Image by author

In terms of how much technical knowledge is required in this tutorial: It does involve some coding in Python, but most of the time we will just use the code to call APIs, so no deep coding knowledge is required, either. It will be useful to be familiar with certain concepts of machine learning, e.g. what it means to train and deploy a model, the concepts of training, validation, and test datasets, and so on. Also having dabbled with the transformers library before might be useful, as we will use this library extensively throughout this tutorial. That all being said I will try to include useful links for further reading for these concepts, if I don’t forget it 😉

Because this tutorial is written by a beginner, I don’t expect NLP experts and advanced deep learning practitioners to get much of this tutorial. At least not from a technical perspective – you might still enjoy the read, though, so please don’t leave just yet! But you will have to be patient with regards to my simplifications – I tried to live by the concept of making everything in this tutorial as simple as possible, but not simpler.

Structure of this tutorial

This series will stretch over five parts in which we will go through different stages of a text summarisation project. In the first part we will start by introducing a metric for text summarisation tasks, i.e. a measure of performance that will allow us to assess whether a summary is "good" or "bad". We will also introduce the dataset we want to summarise and create a baseline using a no-ML "model", i.e. we will use a simple heuristic to generate a summary from a given text. Creating this baseline is a vitally important step in any ML project because it will enable us to quantify how much progress we make by using AI going forward, i.e. it allows us to answer the question "Is it really worth investing in AI technology?"

In the next part (part 2) we will use a model that already has been pre-trained to generate summaries. This is possible of a modern approach in ML called Transfer Learning. You can read more about it in this paper. This is another useful step because we basically take a model off-the-shelf and test it on our dataset. This allows us to create another baseline which will be useful to see what happens when we actually train the model on our dataset. This approach is called zero-shot summarisation, because the model has had zero exposure to our dataset.

After that it is time to use a pre-trained model and train it on our own dataset (part 3). This is also called fine-tuning. It will enable the model to learn from the patterns and idiosyncrasies of our data and slowly adapt to it. Once we have trained the model we will use it to create summaries (part 4).

So, just to summarise (see what I did there?):

What will we have achieved by the end of this tutorial?

Now is the time for brutal reality check, I’m afraid: By the end of this tutorial we will not have a text summarisation model that can be used in production. We won’t even have a good summarisation model (insert scream emoji here)!

What we will have instead is a starting point for the next phase of the project, which is the experimentation phase. This is now where the science in Data Science comes in, because now it’s all about experimenting with different models and different settings to understand whether a good enough summarisation model can be trained with the available training data.

And, to be completely transparent, there is a good chance that the conclusion will be that the technology is just not ripe yet and that the project will not be implemented. And you have to prepare your business stakeholders for that possibility. But that’s a story for another blog post 😉


Related Articles