Photo by NeONBRAND on Unsplash

The Secret Guide To Human-Like Text Summarization

Use Google’s state-of-the-art T5 model to summarize your content

Louis Teo
Towards Data Science
11 min readApr 27, 2021

--

Summarization has become a very helpful way of tackling the issue of data overburden. In my earlier story, I shared how you can create your personal text summarizer using extractive method — if you have tried that, you may have noticed that, because no new sentences were generated from the original content, at times you may have difficulties understanding the generated extractive summary.

In this story, I will share how I use Google’s T5 (Text-to-Text Transfer Transformer) model to create a human-like summarizer with just a few lines of codes!

As a bonus, I will also share my text summarizer pipelines where I combine both extractive and abstractive methods to generate meaningful summaries for PDF documents of any length…

Text Summarization Techniques

There are two techniques to summarize a long content:

i. Extractive summary — extracts important sentences from a long content.

ii. Abstractive summary — creates a summary by generating new sentences from an original content.

Abstractive summary is a comparatively more difficult technique as it involves deep learning, but thanks to Google’s pre-trained models that are made available to public, creating a meaningful abstractive summary is no longer a daunting machine learning task!

T5 Text Summarizer

You can build simple yet incredibly powerful abstractive text summarizer using Google’ T5 pre-trained model. I will use HuggingFace’s state-of-the-art Transformers framework and PyTorch to build a summarizer.

Install packages

Please ensure you have both Python packages installed.

pip install torch
pip install transformers

Load model and tokenizer

Load T5’s pre-trained model and its tokenizer.

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model = AutoModelForSeq2SeqLM.from_pretrained('t5-base')
tokenizer = AutoTokenizer.from_pretrained('t5-base')

There are a total of five T5 models to choose from: t5-small, t5-base, t5-large, t-3B & t5–11B. They each have different parameters. I will choose the ‘t5-base’ model, which has a total of 220 millions parameters. Feel free to try different T5 models.

Input text

Let’s load a CNN news article about ‘Netflix needs a Next Big Thing’ simply because this is rather interesting business news — and see how well our summarizer performs.

text = """New York (CNN Business)Netflix is synonymous with streaming, but its competitors have a distinct advantage that threatens the streaming leader's position at the top.Disney has Disney+, but it also has theme parks, plush Baby Yoda dolls, blockbuster Marvel movies and ESPN. Comcast (CMCSA), Amazon (AMZN), ViacomCBS (VIACA), CNN's parent company WarnerMedia and Apple (AAPL) all have their own streaming services, too, but they also have other forms of revenue.As for Netflix (NFLX), its revenue driver is based entirely on building its subscriber base. It's worked out well for the company — so far. But it's starting to look like the king of streaming will soon need something other than new subscribers to keep growing.The streaming service reported Tuesday it now has 208 million subscribers globally, after adding 4 million subscribers in the first quarter of 2021. But that number missed expectations and the forecasts for its next quarter were also pretty weak.That was a big whiff for Netflix — a company coming off a massive year of growth thanks in large part to the pandemic driving people indoors — and Wall Street's reaction has not been great.The company's stock dropped as much as 8% on Wednesday, leading some to wonder what the future of the streamer looks like if competition continues to gain strength, people start heading outdoors and if, most importantly, its growth slows."If you hit a wall with [subscriptions] then you pretty much don't have a super growth strategy anymore in your most developed markets," Michael Nathanson, a media analyst and founding partner at MoffettNathanson, told CNN Business. "What can they do to take even more revenue out of the market, above and beyond streaming revenues?"Or put another way, the company's lackluster user growth last quarter is a signal that it wouldn't hurt if Netflix — a company that's lived and died with its subscriber numbers — started thinking about other ways to make money.An ad-supported Netflix? Not so fastThere are ways for Netflix to make money other than raising prices or adding subscribers. The most obvious: selling advertising.Netflix could have 30-second commercials on their programming or get sponsors for their biggest series and films. TV has worked that way forever, why not Netflix?That's probably not going to happen, given that CEO Reed Hastings has been vocal about the unlikelihood of an ad-supported Netflix service. His reasoning: It doesn't make business sense."It's a judgment call... It's a belief we can build a better business, a more valuable business [without advertising]," Hastings told Variety in September. "You know, advertising looks easy until you get in it. Then you realize you have to rip that revenue away from other places because the total ad market isn't growing, and in fact right now it's shrinking. It's hand-to-hand combat to get people to spend less on, you know, ABC and to spend more on Netflix."Hastings added that "there's much more growth in the consumer market than there is in advertising, which is pretty flat."He's also expressed doubts about Netflix getting into live sports or news, which could boost the service's allure to subscribers, so that's likely out, too, at least for now.So if Netflix is looking for other forms of near-term revenue to help support its hefty content budget ($17 billion in 2021 alone) then what can it do? There is one place that could be a revenue driver for Netflix, but if you're borrowing your mother's account you won't like it.Netflix could crack down on password sharing — a move that the company has been considering lately."Basically you're going to clean up some subscribers that are free riders," Nathanson said. "That's going to help them get to a higher level of penetration, definitely, but not in long-term."Lackluster growth is still growthMissing projections is never good, but it's hardly the end of the world for Netflix. The company remains the market leader and most competitors are still far from taking the company on. And while Netflix's first-quarter subscriber growth wasn't great, and its forecasts for the next quarter alarmed investors, it was just one quarter.Netflix has had subscriber misses before and it's still the most dominant name in all of streaming, and even lackluster growth is still growth. It's not as if people are canceling Netflix in droves.Asked about Netflix's "second act" during the company's post-earnings call on Tuesday, Hastings again placed the company's focus on pleasing subscribers."We do want to expand. We used to do that thing shipping DVDs, and luckily we didn't get stuck with that. We didn't define that as the main thing. We define entertainment as the main thing," Hastings said.He added that he doesn't think Netflix will have a second act in the way Amazon has had with Amazon shopping and Amazon Web Services. Rather, Netflix will continue to improve and grow on what it already does best."I'll bet we end with one hopefully gigantic, hopefully defensible profit pool, and continue to improve the service for our members," he said. "I wouldn't look for any large secondary pool of profits. There will be a bunch of supporting pools, like consumer products, that can be both profitable and can support the title brands.""""

Tokenize Text

T5 can be used to perform other tasks, such as text generation, translation, etc.; adding T5 specific prefix “summarize: ” will tell the model to perform the summarizing task.

tokens_input = tokenizer.encode("summarize: " + text,
return_tensors='pt',
max_length=tokenizer.model_max_length,
truncation=True)

Here we will tokenize our text to the model’s maximum acceptable token input length. If the tokenized input exceeds the model’s maximum token length, it will be truncated.

Generate Summary

Let’s generate a summary by passing in the encoded tokens and then decode the generated summary back to text.

summary_ids = model.generate(tokens_input, min_length=80, 
max_length=150, length_penalty=15,
num_beams=2)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

The model takes encoded tokens and the following input arguments:

  • min_length: minimum length of tokenized text.
  • max_length: maximum length of tokenized text.
  • length_penalty: value > 1 forces the model to generate a longer summary, value < 1 forces the model to generate a shorter summary.
  • num_beams: value 2 allows the model to explore tokens that generate more promising predictions.

Note: Keeping the minimum and maximum tokenized text lengths between 80 and 150 and a length penalty of 15 will allow the model to generate a reasonable summary of 60 to 90 words. We will use the default values for the rest of input arguments (not shown above).

Output Summary

Netflix (NFLX) reported Tuesday it now has 208 million subscribers globally. that number missed expectations and the forecasts for its next quarter were also pretty weak. the streaming service's stock dropped as much as 8% on Wednesday, leading some to wonder what the future of the streamer looks like. if competition continues to gain strength, people start heading outdoors and if, most importantly, its growth slows, it wouldn't hurt if Netflix started thinking about other ways to make money - like selling ads.

Wow! It looks like a pretty decent summary.

…but if you read the full text and read the summary again, you will notice that the latter part of the full text did not get summarized — this is because the tokenized input get truncated after it exceeds the maximum model token input length of 512.

If you are worried about missing out some important details in the latter text, you can use a simple trick to solve the issue: perform extractive summarization to the original text first, followed by abstractive summarization.

BERT Extractive Summary

Before we proceed, make sure you have pip installed BERT extractive summarizer on your terminal.

pip install bert-extractive-summarizer

BERT stands for Bidirectional Encoder Representations from Transformers. It’s a Transformer-based machine learning technique for Natural Language Processing (NLP) developed by Google. It uses a powerful flat architecture with inter sentence transform layers to get the best result in summarization.

from summarizer import Summarizerbert_model = Summarizer()
ext_summary = bert_model(text, ratio=0.5)

Below is the extractive summary generated by BERT. I purposely set it to produce a summary that is 50% in length of the original text by setting the summary ratio to 0.5. Feel free to use a different ratio to adjust your long document to the appropriate length.

New York (CNN Business)Netflix is synonymous with streaming, but its competitors have a distinct advantage that threatens the streaming leader's position at the top. Disney has Disney+, but it also has theme parks, plush Baby Yoda dolls, blockbuster Marvel movies and ESPN. It's worked out well for the company - so far. But that number missed expectations and the forecasts for its next quarter were also pretty weak. Or put another way, the company's lackluster user growth last quarter is a signal that it wouldn't hurt if Netflix - a company that's lived and died with its subscriber numbers - started thinking about other ways to make money. Not so fast
There are ways for Netflix to make money other than raising prices or adding subscribers. His reasoning: It doesn't make business sense. "It's a judgment call... It's a belief we can build a better business, a more valuable business [without advertising]," Hastings told Variety in September. " You know, advertising looks easy until you get in it. Then you realize you have to rip that revenue away from other places because the total ad market isn't growing, and in fact right now it's shrinking. It's hand-to-hand combat to get people to spend less on, you know, ABC and to spend more on Netflix." So if Netflix is looking for other forms of near-term revenue to help support its hefty content budget ($17 billion in 2021 alone) then what can it do? Netflix could crack down on password sharing - a move that the company has been considering lately. "Basically you're going to clean up some subscribers that are free riders," Nathanson said. " That's going to help them get to a higher level of penetration, definitely, but not in long-term." The company remains the market leader and most competitors are still far from taking the company on. We used to do that thing shipping DVDs, and luckily we didn't get stuck with that. We define entertainment as the main thing," Hastings said. He added that he doesn't think Netflix will have a second act in the way Amazon has had with Amazon shopping and Amazon Web Services. Rather, Netflix will continue to improve and grow on what it already does best. I wouldn't look for any large secondary pool of profits.

Let’s now feed the extractive summary through our T5 model.

The Extractive-Abstractive Summary

Netflix's lackluster user growth is a signal that it wouldn't hurt if it started thinking about other ways to make money. the company remains the market leader and most competitors are still far from taking the company on. the company could crack down on password sharing - a move that the company has been considering lately. "it's a judgment call... it's a belief we can build a better business, a more valuable business," hastings said.

Wow… the generated summary now covers the entire context of the original text.

For your convenience, I have summarized the codes below.

You can also click here to go to my GitHub to get the Jupyter Notebooks for T5 text summarizer and text summarizer pipelines preparation, and the pipeline scripts that you can run on your terminal to summarize multiple PDF documents.

BONUS….T5 Text Summarizer Pipelines

I have built a text summarizer pipelines that can extract text from PDF documents, summarize the text and store both the original text and the summary into a SQLite database and output the summary to a text file.

To summarize a long PDF document, you can first apply extractive summarization to shorten the text before you feed it through the T5 model to generate a human-like summary.

Image by author: run data pipeline to extract text data from pdf files and save to a SQLite database
Image by author: run summarization pipeline (only T5) to summarize text data, save the summary to text file and store the summary to database

Note: key in ‘1.0' if you only want to summarize the text with T5 model.

Image by author: run summarization pipeline (BERT & T5) to summarize text data, save the summary to text file and store the summary to database

Note: key in a ratio below ‘1.0’ (e.g. ‘0.5’) if you wish to shorten the text with BERT extractive summarization before running it through T5 summarization. It takes longer to generate a summary this way because each text is run through two different summarizers.

Conclusion… and future work

There you go — you only need 7 lines of codes (including importing libraries and modules) to get the Google’s T5 pre-trained model to summarize a content for you.

To make the summary of a long content more meaningful, you can apply extractive summarization to shorten the text first, followed by abstractive summarization.

T5 pre-trained models support Transfer-Learning: that means we can train the models further with our custom datasets.

For future work, it will be interesting to see how the models perform after they have been custom-trained to summarize specific contents, e.g. medical journals and engineering journals.

--

--