In ancient China, the talent for summarizing was ranked just below the talents of astrology and medicine. (GPT-3 claim about the task of summarization)
When writing my article about text summarization, an admittedly dull topic, I wanted to spruce it up somehow. So I did what one does when wanting to sound wittier than they are, and prompted GPT-3 with my intro passage. It proceeded to generate a reasonable definition and boilerplate about the different flavors of summarization. Impressive in its ability to emulate what a human would write in an attempt to reach some target word count, but unimpressive in its ability to make my article stand out. Here’s part of it:
In short, text summarization is the process of creating a shortened version of a text document, while preserving its most important information. This can be done automatically or manually. Automated text summarization is often used to generate summaries of news articles, or to provide brief overviews of lengthy documents. (GPT-3 on summarization, attempt #1)
Boring. But I suspected it could do better, afterall its GPT-2 predecessor debuted with a creative unicorn story that made OpenAI, its creator, skyrocket to fame. Getting a generative text model to say something memorable is a bit like playing the lottery – the more tickets you buy, the more likely you are to win. This is because of sampled decoding, a technique that relies on some amount of randomness when navigating the space of potential outputs. So I played the lottery until GPT-3 told me the following:
In ancient China, the talent for summarizing was ranked just below the talents of astrology and medicine. (GPT-3 on summarization, attempt #N)
At last, something I didn’t know, something that could place a seemingly dull task among the ancestral endeavors of human kind. So… jackpot? Maybe, if it were true. Is it?
I wanted it to be true. There’s something poetic about the contrast between two of the most grandiose-sounding preoccupations (astrology and medicine) and the lesser-praised but (apparently) almost equally important task of summarization. Like a redemption of the undervalued.

Despite my due diligence, I still don’t know if this is true. Google can’t find this sentence verbatim on the Internet, and none of the results on the first page seem to support it. So I decided to exclude it from my article; better boring than misleading.
The Groundedness Problem
This problem is well-known to the research community, which phrases it in terms of groundedness. Currently, most generative text models lack the ability to ground their statements into reality, or at least attribute them to some external source; they often hallucinate plausible-sounding falsehoods. This limits their applicability to creative domains (writing fiction, gaming, entertainment etc.) and makes them dangerous in places where truthfulness should be a first class citizen (news, scientific articles, education etc.).
What makes generative models devious is the fact that, most of the time, they sound (and actually are) truthful. When prompted with facts about summarization, GPT-3 repeatedly produced correct statements and gradually built up my confidence in its knowledge. By the time it finally made a statement that was new to me, I was emotionally invested and rooting for it to be true. Even more perfidiously, text generation models can use various tactics to make them sound more authoritative, like including made up numbers, domain-specific jargon, or an abundance of (fake) details.
The situation becomes even more worrisome in badly intentioned hands. Generative models provide a tool for bad actors to lie at scale. They can inundate social platforms with overwhelming amounts of content that seems true just because of its sheer volume. They can also target individuals by portraying falsehoods in a way that convinces each person in particular, based on their social media profile.
Retrieval: One Step Towards Groundedness
The way researchers are currently attempting to tackle groundedness is by incorporating an additional retrieval step in the generation process: before producing the output text, the model is trained to perform a lookup in some external database and gather supporting evidence for its claims.
Google’s state-of-the-art conversational model LaMDA [1] uses an entire tool set to help ground its responses: a retrieval component, a calculator, and a translator. Its output is produced in multiple sequential steps, which iteratively improve its veracity:
- Based on the history of the conversation, LaMDA first produces an unconstrained response (ungrounded at this stage).
- LaMDA evaluates whether adjustments are needed (i.e., whether to use any of its tools).
- When LaMDA decides that supporting evidence is needed, it (a) produces a query, (b) issues the query to look up a supporting document, and (b) rewrites the previous response in a way that is consistent with the retrieved source.
- Steps 2. and 3. are repeated until LaMDA decides no further adjustments are needed. Note this might require retrieving multiple pieces of evidence.
While this method is shown to improve groundedness over a baseline model without retrieval (from 60% to just under 80%), it is still far from human performance (~95% even when humans don’t have access to an information retrieval system).
Why Groundedness is Still Difficult
Defining and measuring groundedness is a challenge in and of itself. In a recent paper from Google [2], researchers make the distinction between fact checking (i.e., making a judgement on whether a statement is universally true) and attribution (i.e. identifying a supporting document). The latter is somewhat more tractable, since it postpones the question of whether the identified source is credible. But even after de-coupling these two concepts, reasoning about attribution is non-trivial. Often, deciding whether a document supports a statement requires some commonsense knowledge or additional information [2]:
Supporting document: The runtime of the theatrical edition of "The Fellowship of the Ring" is 178 minutes, the runtime of "The Two Towers" is 179 minutes, and the runtime of "The Return of the King" is 201 minutes.
Statement: The full run-time of "The Lord of the Rings" trilogy is 558 minutes.
In the example above, the statement is supported by the document only with the additional information that the trilogy consists of the three movies mentioned – so is the statement attributable to the evidence? One could settle such debates by asking whether an average human would find the evidence sufficient. But this is also non-trivial to determine, since it’s highly reliant on one’s cultural background.
Another challenge in ensuring groundedness is the lack of training data. In order to teach a model about attribution, one must provide positive and negative pairs of <statement, evidence>. Given how nuanced attribution can be, such data points require manual annotation. For instance, the LaMDA authors collected training instances by showing crowdworkers ungrounded model responses (from step 1) and asking them to manually perform steps 2 through 4: issue queries, collect supporting evidence, and modify the model’s response until it is consistent with the evidence.
Finally, incorporating a retrieval component is an engineering challenge. On each training step, the model needs to make (potentially multiple) lookups into an external database, which slows down training. This latency concern also applies during inference, which is already a pressing problem for Transformer-based models.
Conclusion
Most text generation models (including the GPT family) are prone to making false statements due to their inability to ground their responses in the external world. This is because they were trained to sound truthful, not to be truthful. Models like LaMDA attempt to address this issue by incorporating a retrieval component (i.e. lookups into an external database) and iteratively improving the model response until it is consistent with the evidence. While promising, this strategy is not full proof. It will be interesting to see how the community responds to this pressing challenge. In particular, will the much anticipated GPT-4 rise to the occasion?
References
[1] LaMDA: Language Models for Dialog Applications (Thoppilan et al., 2022)
[2] Measuring Attribution in Natural Language Generation Models (Rashkin et al., 2022)