Out of all Natural Language Processing (NLP) tasks, summarization is arguably one of the least headline-worthy. Shrinking the content of an article is a lot less dazzling than having GPT-3 automatically generate startup ideas. However, despite its lower-key profile, text summarization is far from being solved, especially in industry. The rudimentary APIs provided by big names like Microsoft leave plenty of room for smaller companies to tackle summarization from various angles, without a clear winner in sight yet. This article discusses the reasons why text summarization remains a challenge.

Summarization is more than text-to-text
Naively, summarization can be regarded as a text-to-text transformation via lossy compression. Realistically, summarization is a (text+context)-to-text problem.
Summarization reduces a longer text into a shorter one, while discarding less important information in the process. But what is important? Making a universal judgement is very difficult, since the answer is highly dependent on the domain of the text, the target audience, and the goal of the summary itself. For instance, consider a scientific paper about COVID-19. Should the summary include any biology jargon, or should it be accessible to the layman? Should it be a dry list of the main factual discoveries, or should it be snappy and suspenseful in order to convince the user to read the entire article?
In other words, what constitutes a good summary depends on the context. Summarization is not just a text-to-text transformation, but rather a (text+context)-to-text problem. General-purpose summarization APIs like Microsoft’s Azure Cognitive Service adhere to the naive text-to-text definition. Other than the desired summary length, they do not allow any specifications about the desired output. This is why they fail to deliver in realistic applications, where contextual nuances can make or break a product.
Extraction is naive, abstraction is ungrounded
In the summarization field, there is an established dichotomy: summaries can be either extractive (i.e., snippets from the original text), or abstractive (i.e., newly-generated text). Extractive summaries locate key sentences within the original text, and therefore tend to accurately reflect the main idea (unless sentences are maliciously cherry-picked in a way that misrepresents it). The downside is that extractive summaries are limited in their ability to fully capture the content – the sentences that are dropped are forever lost.
On the other hand, abstractive summaries emulate more closely how humans summarize: we put together new sentences that aim to tell the whole story, but from a higher-level perspective. However, abstractive summaries are technically difficult to produce, since they require generative models like the GPT family. Currently, such models suffer from hallucination – their outputs can be factually incorrect and/or not supported by the original text. This is a huge impediment for summarization, since staying true to the input text is not negotiable. While preventing hallucinations is an active field of research, there are no bullet-proof solutions that guarantee a certain quality bar. This is why the abstractive summarization APIs from Google is still experimental (i.e., it is not supported officially like products offered through Google Cloud Platform).
NLP is still struggling with long documents
By definition, summarization assumes a long input text; otherwise, there would be no need for it in the first place. However, the NLP field is still struggling to handle sizable documents.
The dominating model architecture, the Transformer, enforces a maximum number of input tokens in the low thousands. Any document that exceeds this token count needs to be split into fragments that are to be summarized separately. The end result is a mere stitching of the independent sub-summaries. We might get away with this technique for certain document types like news articles, which are naturally split into somewhat independent sections. However, it becomes unusable when applied to fiction books, which by design contain intricate dependencies between chapters. For instance, the arch of a character often spans the entire book; since sub-summaries are simply concatenated (and are unaware of each other), the trajectory of the character in the summary will also be fragmented. No one sentence will succinctly capture their journey.
How companies tackle summarization
Since there is no universal solution to summarization, companies are tackling it in very different ways, from manual to fully automated, depending on the nature of the documents that are being targeted.
Human-generated summaries for books
At one end of the spectrum, companies like Blinkist and 12min hire human experts to produce high-quality summaries for nonfiction books, which can be read in under 15 minutes. While this approach does ensure high quality content, it does not scale beyond a human-curated list of bestsellers, so this won’t work if your reading taste is off the beaten path.
Automated summaries for medium-sized content
Producing summaries for medium-sized content like blogposts, news articles, research papers or internal corporate documents is more amenable to automated processing, but still laborious. Each use case, defined by the input domain (e.g. news, legal, medical, etc.) and the output format (bullet points, highlights, single-passage summary, etc.) requires a separate training dataset, and potentially a separately trained model.
A recent blogpost from Google AI, which announces a new feature for auto-generated summaries in Google Docs, makes the case for collecting clean training sets that focus on a specific input domain and enforce a consistent style across the collected summaries:
[…] early versions of [the summarization] corpus suffered from inconsistencies and high variation because they included many types of documents, as well as many ways to write a summary – e.g., academic abstracts are typically long and detailed, while executive summaries are brief and punchy. This led to a model that was easily confused because it had been trained on so many different types of documents and summaries that it struggled to learn the relationships between any of them. […] We carefully cleaned and filtered the fine-tuning data to contain training examples that were more consistent and represented a coherent definition of summaries. Despite the fact that we reduced the amount of training data, this led to a higher quality model. (Excerpt from Google AI blogpost)
Consistent with its own advice above, Google offers different end points for its experimental abstractive summarization API, with each one focusing on a rather narrow application:

Startup case study: Summari
(This article is not sponsored by Summari or any other of the products mentioned.)
The lack of a monolithic model that can perform summarization in any context presents a great opportunity for startups; they can be laser-focused on niches that are less appealing to big tech.
For instance, Summari set out on a mission to help Internet consumers screen articles before committing to reading them end-to-end. In an earlier interview, the founder expressed disappointment towards the state-of-the-art in summarization, and initially opted for human-generated summaries:
Unfortunately, we did not get the type of quality that we wanted from AI technology. We believe there is an art to a great summary, it’s not just copying select phrases from the text, there is a deeper understanding required and it requires a human, at least for now. (Ed Shrager, founder of Summari, in an interview for Ness Labs)
Fast forward approximately a year, Summari now offers a Chrome extension that produces highlights for virtually any website with text content. The human-curated summaries have no doubt amounted to a clean training set that enabled them to build a model which automates and scales their original mission. Here is their summary for this article so far.
Beyond text: audio and video summaries
Compared to audio and video, text is arguably the simpler modality to summarize. It comes to no surprise that the state-of-the art models and industry practices for audio and video lag behind text.
For instance, there is little to no automation in shortening podcasts. The common practice is for podcasters to release human-curated short YouTube clips cropped from their full-length episodes (see, for instance, Joe Rogan’s channel). This is the manual equivalent of extractive summarization. In contrast, Blinkist works directly with podcast creators to additionally produce shorter versions of their episodes, which they call "shortcasts" – the manual equivalent of abstractive summarization.
There is, however, some automation in sight. Startups like Snackable aim to automatically extract and stitch together the key fragments from audio and video files, for now in a purely extractive manner. With progress in video manipulation and generation, it may be just a matter of time until abstractive summarization is possible for these modalities.
Conclusion
Summarizing text is a difficult task because it is highly dependent on context. Because of that, is very unlikely that we will converge towards a single universal solution, or that we can rely on the almighty GPT models to produce the right summary for every circumstance. This fragmented landscape poses an opportunity for startups to invest in clean training sets that focus on very specific use cases, and complement the offerings of big tech.
Special thanks to Gaurav Nemade for sharing his always-thoughtful perspective with me.