A beginner’s guide to language models

The closest things we have to an AI

Published in

Towards Data Science

8 min readJan 8, 2021

The science of extracting information from textual data has changed dramatically over the past decade. As the term Natural Language Processing took over Text Mining as the name of this field, the methodology used has changed tremendously, too. One of the main drivers of this change was the emergence of language models as a basis for many applications aiming to distill valuable insights from raw text.

I recently completed the Natural Language Processing Specialization on Coursera created by the deeplearning.ai team. One of the things I was fascinated by is the evolution of language models in the past years. I assume many of you have heard about GPT-3 and the potential threats it poses. But how did we come this far? How can a machine produce an article that mimics a journalist assessing the quality of the text produced by the machine?

What is a language model?

Language models — among other things — can suggest the next word we type. Source

A language model is basically a probability distribution over words or word sequences. In practice, a language model gives the probability of a certain word sequence being “valid”. Validity in this context does not refer to grammatical validity at all. It means that it resembles how people speak (or, to be more precise, write) — which is what the language model learns. This is an important point: there is no magic to a language model (like other machine learning models, particularly deep neural networks), it is “just” a tool to incorporate abundant information in a concise manner that is reusable in an out-of-sample context.

What can a language model do for us?

The abstract understanding of natural language — which is necessary to infer word probabilities from context — can be used for a number of tasks. Lemmatization or stemming aims to reduce a word to its most basic form, thereby dramatically decreasing the number of tokens. These algorithms work better if the part-of-speech role of the word is known: a verb’s postfixes can be utterly different from a noun’s postfixes — hence the rationale for part-of-speech tagging (or POS-tagging), a common task for a language model.

With a good language model, we can perform extractive or abstractive summarization of texts. If we have models for different languages, a machine translation system can be built easily. Less straightforward use-cases include question answering (with or without context, see the example at the end of the article). Language models can also be used for speech recognition, OCR, handwriting recognition and more.There is a whole spectrum of opportunities.

Types of language models

The probability formula for 3-grams or trigrams. C means counts, w stands dor word, upper index means n-gram, lower index is simple word order index. Image courtesy of deeplearning.ai. Source.

It is important to note the difference between

a) probabilistic methods, and

b) neural network based modern language models.

A simple probabilistic language model (a) is constructed by calculating n-gram probabilities (an n-gram being an n word sequence, n being an integer greater than 0). An n-gram’s probability is the conditional probability that the n-gram’s last word follows the a particular n-1 gram (leaving out the last word). Practically, it is the proportion of occurences of the last word following the n-1 gram leaving the last word out. This concept is a Markov assumption — given the n-1 gram (the present), the n-gram probabilities (future) does not depend on the n-2, n-3, etc grams (past) .

There are evident drawbacks of this approach. Most importantly, only the preceding n words affect the probability distribution of the next word. Complicated texts have deep context that may have decisive influence on the choice of the next word. Thus, what the next word is might not be evident from the previous n-words, not even if n is 20 or 50. A term have influence on a previous word choice: the word United is much more probable if it is followed by States of America. Let’s call this the context problem.

On top of that, it is evident that this approach scales poorly: as size increases (n), the number of possible permutations skyrocket, even though most of the permutations never occur in the text. And all the occuring probabilities (or all n-gram counts) have to be calculated and stored! In addition, non-occuring n-grams create a sparsity problem, as in the granularity of the probability distribution can be quite low (word probabilities have few different values, therefore most of the words have the same probability).

Embeddings eliminate the sparsity problem. Source

Neural network based language models (b) ease the sparsity problem by the way they encode inputs. Embedding layers create an arbitrary sized vector of each word that incorporates semantic relationships as well (if you are not familiar with word embeddings, I suggest reading this article). These continous vectors create the much needed granularity in the probability distribution of the next word. Moreover, the language model is practically a function (as all neural networks are, with lots of matrix computations), so it is not necessary to store all n-gram counts to produce the probability distribution of the next word.

The evolution of language models

Even though neural networks solve the sparsity problem, the context problem remains. First, the way language models were developed was about solving the context problem more and more efficiently — bringing more and more context words to influence the probability distribution, and do so more efficiently. Secondly, the goal was to create an architecture that gives the model the ability to learn which context words are more important than others.

The first model, which I outlined previously is basically a dense (or hidden) layer and an output layer stacked on top of a Continous Bag-of-Words Word2Vec model (CBOW). A CBOW Word2Vec model is trained to guess the word from context (a Skip-Gram Word2Vec model does the opposite — guesses context from the word). Thus, in practice, it is trained by providing it with a lot of examples of the following structure: the inputs are n words before and/or after the word, which is the output. We can see that the context problem is still intact.

RNNs

An improvement regarding this matter is the use of Recurrent Neural Networks (RNNs) (if you’d like a thorough explanation of RNNs I suggest reading this article). Being either an LSTM or a GRU cell based network, it takes all previous words into account when choosing the next word. For a further explanation on how RNNs achieve long memory please refer to this article. AllenNLP’s ELMo takes this notion futher by utilising a bidirectional LSTM, thereby all context before and after the word counts.

Embeddings for Language Models, or ELMo. Source

Transformers

The main drawback of RNN based architectures stems from their sequential nature. As a consequence, for long sequences training times soar because there is no possibility for paralellization. The solution for this problem is the transformer architecture. It is worth reading the original paper from Google Brain.

The GPT models from OpenAI and Google’s BERT are utilising the transformer architecture as well. You can read more about their architectures here: GPT-3, BERT. These models also employ a mechanism called Attention, by which the model can learn which inputs deserve more attention than others in certain cases.

In terms of model architecture, the main quantum leaps were firstly RNNs (specifically LSTM and GRU) solving the sparsity problem and making language models use much less disk space, and subsequently, the transformer architecture, making paralellization possible and creating attention mechanisms. But architecture is not the only aspect a language model can excel in.

Question answering accuracies for GPT-1 (green), GPT-2 (orange) and GPT-3 (blue). Source

Compared to the GPT-1 architecture, GPT-3 has virtually nothing novel. But it is huge. It has 175 billion parameters, and was trained on the largest corpus a model has ever been trained on: Common Crawl. This is partly possible because of the semi-supervised training strategy of a language model — a text can be used as a training example with some words omitted. The incredible power of GPT-3 comes from the fact that it read more or less all text that appeared on the whole internet in the past years, and has the capability to reflect most of the complexity natural language has.

Trained for multiple purposes

Finally, I’d like to show you the T5 model from Google. Previously, language models were used for standard NLP tasks, like Part-of-speech (POS) tagging or machine translation with slight modifications. For example, with a little retraining, BERT can be a POS-tagger — because of it’s abstract ability to understand the underlying structure of natural language. Here is an implementation of it on GitHub.

T5 capabilities with no retraining. Source

With T5, there is no need for any modification for NLP tasks. If it gets a text with some <M> tokens in it, it knows that those tokens are gaps to fill with the appropriate words. It can also answer questions. If it gets some context after the questions, it searches the context for the answer, otherwise, it answers from its own knowledge. Fun fact: in a trivia quiz, it has beaten its own creators! The picture in the left shows some other use cases as well.

Conclusion

Personally, I think this is the field where we are to closest to achieve creating an AI. There is a lot of buzz around this word and many simple decision systems or almost any neural network are called AI, but this is mainly marketing. According to the Oxford Dictionary of English, or just about any dictionary, Artificial Intelligence is human-like intelligence capabilities performed by a machine. In fairness, transfer learning shines in the field of computer vision too, and the notion of transfer learning is essential for an AI system. But the very fact that the same model can do a wide range of NLP tasks and can infer what to do from the input is itself spectacular, and brings us one step closer to actually creating human-like intelligence systems.