A brief history of language models

Breakthroughs on the way towards GPT — explained for non-experts

Published in

Towards Data Science

9 min readMay 12, 2023

The overwhelming attention large language models like GPT get in media today creates the impression of an ongoing revolution we all are in the middle of. However, even a revolution builds on the successes of its predecessors, and GPT is the result of decades of research.

In this post, I want to give an overview of some of the major steps in research in the realm of language models, that eventually led to the large language models we have today. I will briefly describe what a language model is in general, before discussing some of the core technologies that were leading the field at different times, and that, by overcoming the hurdles and difficulties of their ancestors, paved the way for today's technologies, of which (Chat-)GPT may be the most famous representative.

What is a language model?

What is necessary to turn words into a language model? Photo by Glen Carrie on Unsplash

A language model is a machine learning model, that predicts the next word given a sequence of words. It is as simple as that!

The main idea is, that such a model must have some representation of the human language. To some extent, it models the rules our language relies on. After having seen millions of lines of text, the model will represent the fact, that things like verbs, nouns, and pronouns exist in a language and that they serve different functions within a sentence. It may also get some patterns that come from the meaning of words, like the fact, that “chocolate” often appears within a context of words like “sweet”, “sugar” and “fat”, but rarely together with words like “lawnmower” or “linear regression”.

As mentioned, it arrives at this representation by learning to predict the next word given a sequence of words. This is done by analyzing large amounts of text to infer, which word may be next for a given context. Let’s take a look at how this can be achieved.

The starters

We have to start simply before we can think of more sophisticated technologies. Photo by Jon Cartagena on Unsplash

Let’s start with a first intuitive idea: Given a large number of texts, we can count the frequency of each word in a given context. The context is just the words appearing before. That is, for example, we count how often the word “like” appears after the word “I”, we count, how often it appears after the word “don’t” and so on for all words that ever occur before the word “like”. If we divide this by the frequency of the word before, we easily arrive at the probability P(“like” | “I”), read as the probability of the word “like” given the word “I”:

P(“like” | “I”) = count(“I like”) / count(“I”)

P(“like” | “don’t”) = count(“don’t like”) / count(“don’t”)

We could do that for every word pair we find in the text. However, there is an obvious drawback: The context that is used to determine the probability is only a single word. That means, if we want to predict what comes after the word “don’t”, our model does not know what was before the “don’t” and hence can’t distinguish between “They don’t”, “I don’t” or “We don’t”.

To tackle this problem, we can extend the context. So, instead of calculating P(“like” | “don’t”), we calculate P(“like” | “I don’t”) and P(“like” | “they don’t”) and P(“like” | “we don’t”) and so on. We can even extend the context to more words, and that we call an n-gram model, where n is the number of words to consider for the context. An n-gram is just a sequence of n words, so “I like chocolate”, for example, is a 3-gram.

The larger the n, the more context the model can take into account for predicting the next word. However, the larger the n, the more different probabilities we have to calculate because there are many more different 5-grams than are 2-grams, for example. The number of different n-grams grows exponentially and easily reaches a point where it becomes infeasible to handle them in terms of memory or calculation time. Therefore, n-grams only allow us a very limited context, which is not enough for many tasks.

Recurrent Neural Networks

Recurrent Neural Networks (RNNs) introduced a way to solve the issues the n-gram models have with bigger contexts. In an RNN, an input sequence is processed one word after the other, generating a so-called hidden representation. The main idea is, that this hidden representation includes all relevant information of the sequence so far and can be used in the next step to predict the next word.

Let’s look at an example: Say we have the sentence

The mouse eats the cheese

The RNN now processes one word after the other (“mouse” first, then “eats”,…), creates the hidden representation, and predicts which word is coming next most likely. Now, if we arrive at the word “the”, the input for the model will include the current word (“the”) and a hidden representation vector, that includes the relevant information of the sentence “the mouse eats”. This information is used to predict the next word (e.g. “cheese”). Note that the model does not see the words “the”, “mouse” and “eats”; those are encoded in the hidden representation.

Is that better than seeing the last n words, as the n-gram model would? Well, it depends. The main advantage of the hidden representation is, that it can include information about sequences of varying sizes without growing exponentially. In a 3-gram model, the model sees exactly 3 words. If that is not enough to predict the next word accurately, it can’t do anything about that; it doesn’t have more information. On the other hand, the hidden representation used in the RNNs includes the whole sequence. However, it somehow has to fit all information in this fixed-sized vector, so the information is not stored in a verbatim way. If the sequence becomes longer, this can become a bottleneck all relevant information has to pass through.

It may help you to think of the difference like this: The n-gram model only sees a limited context, but this context it sees clearly (the words as they are), while the RNNs have a bigger and more flexible context, but they see only a blurred image of it (the hidden representation).

Unfortunately, there is another disadvantage to RNNs: Since they process the sequence one word after another, they can not be trained in parallel. To process the word at position t, you need the hidden representation of step t-1, for which you need the hidden representation of step t-2, and so on. Hence the computation has to be done one step after another, both during training and during inference. It would be much nicer if you could compute the required information for each word in parallel, wouldn’t it?

Attention to the rescue: Transformers

Attention is all about hitting the right spot. Photo by Afif Ramdhasuma on Unsplash

Tranformers are a family of models that tackle the drawbacks the RNNs have. They avoid the bottleneck problem of the hidden representation, and they allow to be trained in parallel. How do they do this?

The key component of the transformer models is the attention mechanism. Remember that in the RNN, there is a hidden representation that includes all information of the input sequence so far. To avoid the bottleneck that comes from having a single representation for the whole sequence, the attention mechanism constructs a new hidden representation in every step, that can include information from any of the previous words. That allows the model to decide which parts of the sequence are relevant for predicting the next word, so it can focus on those by assigning them higher relevance for calculating the probabilities of the next word. Say we have the sentence

When I saw Dorothy and the scarecrow the other day, I went to her and said “Hi

and we want to predict the next word. The attention mechanism allows the model to focus on the words that are relevant for the continuation and ignore those parts, that are irrelevant. In this example, the pronoun “her” must refer to “Dorothy” (and not “the scarecrow”), and hence the model must decide to focus on “Dorothy” and ignore “the scarecrow” for predicting the next word. For this sentence, it is much more likely that it continues with “Hi, Dorothy” instead of “Hi, scarecrow” or “Hi, together”.

An RNN would just have a single hidden representation vector, that may or may not include the information that is required to decide whom the pronoun “her” refers to. In contrast, with the attention mechanism, a new hidden representation is created, that includes much information from the word “Dorothy”, but less from other words that are not relevant at the moment. For the prediction of the next word, a new hidden representation will be calculated again, which may look very different, because now the model might want to put more focus on other words, e.g. “scarecrow”.

The attention mechanism has another advantage, namely that it allows parallelization of the training. As mentioned before, in an RNN, you have to calculate the hidden representation for each word one after another. In the Transformer, you calculate a hidden representation at each step, that only needs the representation of the single words. In particular, for calculating the hidden representation of step t, you don’t need the hidden representation of step t-1. Hence you can calculate both in parallel.

The increase in model sizes over the last years, which allows models to perform better and better each day, is only possible because it became technically feasible to train those models in parallel. With recurrent neural networks, we wouldn’t be able to train models with hundreds of billions of parameters and hence wouldn’t be able to use those models’ capabilities of interacting with natural language. The Transformer’s attention mechanism can be seen as the last component that, together with large amounts of training data and decent computational resources, was needed for developing models like GPT and its siblings and starting the ongoing revolution in AI and language processing.

Summary

So, what have we seen in this post? My goal was to give you an overview of some of the major steps that were necessary to arrive at the powerful language models we have today. As a summary, here are the important steps in order:

The key aspect of language modeling is to predict the next word given a sequence of text.
n-gram models can represent a limited context only.
Recurrent Neural Networks have a more flexible context, but their hidden representation can become a bottleneck, and they can’t be trained in parallel.
Transformers avoid the bottleneck by introducing the attention mechanism, that allows to focus on specific parts of the context in detail. Eventually, they can be trained in parallel, which is a requirement for training large language models.

Of course, there have been many more technologies that were required to arrive at the models we have today. This overview is just highlighting some very important key aspects. What would you say, which other steps were relevant on the journey towards large language models?