The world’s leading publication for data science, AI, and ML professionals.

GPT-3: The good, the bad and the ugly

If you follow the latest AI news, you probably came across several stunning applications of the latest Language Model (LM) released by…

Opinion

Will large language models change the way we develop NLP applications?

Photo by Raphael Schaller on Unsplash
Photo by Raphael Schaller on Unsplash

If you follow the latest AI news, you probably came across several stunning applications of the latest Language Model (LM) released by OpenAI: GPT-3. The applications that this LM can fuel reach from question answering to generating Python code. The list of use cases is growing daily. Check out the following youtube videos: GPT-3 demo and explanation, 14 cool GPT-3 apps and 14 more GPT-3 apps.

GPT-3 is currently in beta and only a restricted number of people have access, but will be released to everybody on October 1st. OpenAI was very much interested in spreading the hype and showing amazing samples of cool applications. As of September 22, 2020, their strategy obviously worked out. While writing this blog post, Microsoft announced that they acquired the exclusive rights on the language model. OpenAI will probably continue to license the access to the LM via an API, but the purchase by Microsoft allowed OpenAI to get an ROI of their investment of $4.6 million – the estimated cost of training this massive LM.

Because OpenAI is quite successful in their marketing by enlisting many people to post fascinating examples which are strictly speaking only anecdotal evidence of the capabilities, one should see the current hype with some skepticism. People will most likely only post examples that confirm their bias that the machine "understands" language at a new level. At the same time the negative examples such as racist stories that are automatically generated when your prompt is "three muslims", as discussed further below, should raise concern about potentially doing more harm than good.

Before I discuss in more detail "the Good, the Bad, and the Ugly", let’s briefly review what the main contribution of GPT-3 is. OpenAI released a previous version called GPT-2 last year. The technology has not changed since then. It is basically the enormous amount of data that led to the LM with now 175 Billion parameters compared to currently used LM such as T5 with 11 Billion parameters. After training the model data largely crawled from the "Internet", the authors were able to show that the system was able to reach or even beat State-Of-The-Art systems in various NLP tasks (e.g., question answering, machine translation). Most impressive, however, was the fact that the system was never trained on the tasks and was able to achieve reasonable performance with no, one or just a few examples (i.e., no-shot/one-shot/few-shot learning).

Comparison between in-context learning and fine-tuning (source: https://arxiv.org/abs/2005.14165)
Comparison between in-context learning and fine-tuning (source: https://arxiv.org/abs/2005.14165)

The figure from the GPT-3 paper illustrates how GPT-3 can be told with just a handful of examples how to do a task in contrast to the traditional approach of fine-tuning a deep learning model by feeding it lots of examples (…). In addition, the fine-tuning also requires you to define the solution space (i.e., the number of labels) in advance and you have to make sure you have enough examples in your training data so that the machine can learn how to distinguish the different classes. All this is not required when using GPT-3 (provided enough data for the task was available in the data that was fed to the LM).

The Good

GPT-3 shows impressive results for a number of NLP tasks such as questions answering (QA), generating code (or other formal languages/editorial assist) and (fictional) story generation. Those applications have shown impressive results and and will most likely be incorporated into already existing system showing improvements over the current state-of-the-art.

The GPT-3 paper shows, for example, impressive results for various QA tasks such as TriviaQA. It is quite promising that few-shot learning often shows better results than one-shot or zero-shot learning indicating that with more labeled examples the LM may improve even more.

These are the application that probably will benefit from GPT-3 the most:

  • Improving question answering (if used as an additional component of a complex QA-system).
  • Supporting creating content (as we have already seen with auto-suggest). Drafting assistants of documents, query suggestions are applications that come to mind.
  • Easy access to a deep learning technology will allow to create powerful applications without a lot of training data.

In particular, the last bullet point could lead to the assumption that it will be easy to create applications without much technical knowledge. Using the GPT-3 playground would allow everybody to explore concepts that can be quickly tested via zero- of few-shot learning. The following examples shows how somebody with just two examples would get answers for questions like "What are the elements of…?" I looked for answers of the first two questions on the internet and the answer for third question was generated by the system. Without the two examples, the system’s answer were too short and not informative, but just two examples put it on the "right" path (bold is my input, normal font is GPT-3’s continuation):

Q: What are the elements of trespass?

A: Trespass is an invasion in to another’s exclusive right to possession of property. It is a direct unlawful interference with another’s property. Generally, the main elements of trespass are an unlawful intrusion or invasion upon a property, intent of intrusion, force, and consequent injury to an owner.

Q: What are the elements of fraud?

A: In the United States, common law generally identifies nine elements needed to establish fraud: (1) a representation of fact; (2) its falsity; (3) its materiality; (4) the representer’s knowledge of its falsity or ignorance of its truth; (5) the representer’s intent that it should be acted upon by the person in the manner reasonably contemplated; (6) the injured party’s ignorance of its falsity; (7) the injured party’s reliance on its truth; (8) the injured party’s right to rely thereon; and (9) the injured party’s consequent and proximate injury.

Q: What are the elements of a civil conspiracy?

A: (1) an agreement between two or more persons; (2) an overt act by one or more parties to further the purpose of the conspiracy; and (3) damages suffered by the injured party as a result of the conspiracy.

Although these answers sound very impressive (at least to me as a non-lawyer), one should be aware of the risks that using the LM without any further safeguards might bring.

The Bad

One of the QA tasks GPT-3 was tested on was NaturalQS that focusses on factual accuracy. GPT-3 underperformed in this task, whereas it got high marks for trivia questions. This behavior is troubling because it seems to indicate that question answer pairs that are frequently found on the internet are more likely to be given as correct answers. But text understanding that is required to answer a complex question from just one example of text is clearly beyond the capability of the LM. If the answer, however, sounds authoritative and is written in correct English, humans may not spot the wrong answer so easily.

As a matter of fact, it’s getting more and more difficult for humans to distinguish news written by a machine from articles written by humans. One of the experiments reported in the GPT-3 paper showed that humans have a hard time identifying machine generated news. The larger the LM got the more problems humans had correctly identifying machine-written news and with the largest version of GPT-3 (175B parameters) the decision was basically a coin flip.

Another risk of using this LM unfiltered is the missing grounding of the answers. Even though the generated sentence may provide the correct answer, there is no way to back up the statement. The language model is only grounded in frequencies of words but not in the deep understanding of statutes and case law, for example. A recent academic paper by Emily Bender and Alexander Koller provides a similar criticism stating that the meaning of language cannot be learned from LMs.

An even more devastating rebuke of GPT-3 was delivered by Gary Marcus and Ernest Davis in a recent MIT Technology Review article. They showed that the model does not understand what it is generating via various continuations of complex situations that would require social/biological/physical or other kind of reasoning (again, normal font is GPT-3’s continuation) :

You poured yourself a glass of cranberry juice, but then you absentmindedly poured about a teaspoon of grape juice into it. It looks okay. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

You are now dead.

Somehow GPT-3 thinks that grape juice is poisonous although the internet offers many drink recipes that contain cranberries and grapes as ingredients. Moreover, the conclusion that the drink may be fatal comes somehow out of nowhere. Marcus and Davies conclude that GPT-3 "[i]s a fluent spouter of bullshit, but even with 175 billion parameters and 450 gigabytes of input data, it’s not a reliable interpreter of the world."

In addition to these risks, the LM model works well for language generation only, may this be as an answer or a fictional text. Other NLP tasks, on the other hand, can not so easily be solved with the help of GPT-3. Typical tasks such as named entity extraction (i.e., labeling strings wether they are companies or person names) or text classification task are more challenging for a LM.

The Ugly

It’s a well-known fact that NLP applications such as chat bots can sometimes be difficult to control and one may end up with a program that spews out racist or sexist comments, as Microsoft had to learn when they released their chatbot Tay in 2016. To their credit, OpenAi addressed this problem right from the start and they identify toxic or simply political content generated with a warning. It needs to be seen how they will control applications that may only by accident (or purposefully) generate racist or sexist language.

Warning generated by Playground at beta.openai.com (image by author)
Warning generated by Playground at beta.openai.com (image by author)

Other beta user were also quick to point out that prompting GPT-3 with "three muslims" will often lead to text where they are depicted as terrorist or criminals. My own experiments confirmed this bias and I also found a similar tendency of portraying them in a stereotypical fashion when I prompted the LM with other religious groups or nationalities.

Debiasing LM is an active research topic in the community and I expect to see even more activity in this area. OpenAI is clearly aware of this and they spend a lot of time in the terms of use on how their API should and shouldn’t be used.

Conclusions

Despite the restrictions and possible toxic text GPT-3 may generate, I believe this LM is a fascinating new tool that will probably trigger improvements of NLP tasks that require to generate language. Combined with other technology and the respective safeguards it will push the AI capabilities we can use for our products even further. People may also come up with new applications of this technology nobody has yet really thought of. Translating Legalese to plain English may only be the start of further innovation this technology will spur.


Related Articles