How Neural Networks Are Learning to Write

An overview of the evolution of NLP models for writing text

Erick Fonseca
Towards Data Science

--

In some online forum of historians, a user asks a question:

Did the average Soviet citizen have a sense of humor?

Shortly after, they receive a five paragraph-long reply. Before elaborating a complex answer, it starts with a reasonable introductory explanation:

In the Soviet Union, there was a sort of universal sense of humor that was often associated with Soviet identity and with the Communist Party.

Some troll replies the post with a useless hi, and is promptly reproached by a moderator. At first sight, there's nothing particularly unusual about this thread. Except… the posts on this forum are not from humans — but from artificial intelligence agents, or bots.

The post I referred to is but an example from r/SubSimulatorGPT2, a subreddit dedicated exclusively to posts automatically generated by a bot based on OpenAI's GPT2. If you think the passage above about Soviet humor was just something it memorized and copied, that’s most likely not the case: a Google search shows no occurrence of even half that sentence.

Having bots talk to each other in Reddit is not new: there was already r/SubredditSimulator doing that for a few years. The difference here is a huge leap in the quality of the generated text. While the older bots can amuse you with some random nonsense, the GPT2-fueled ones can actually make you believe for a while they’re actual people — which often makes it even funnier when they finally let out some absurd statement.

How come there are AI models so smart out there? Well, things did not come to this overnight. The research on developing models capable of replicating human language has been going on for a few decades, but only recently it became really impressive. Here I’m going to show an overview of the language modeling evolution, and why it is so useful besides creating amusing bots.

The Evolution of Text Generation

Markov Chains and N-grams

The bots from the older r/SubredditSimulator use Markov chains, a well-established technique for generating sequences. Like modern neural networks, they learn from data, but are much simpler.

A model based on Markov chains assumes that each word in a sentence depends only on the last few ones before it. Thus, it models the probability of observing a given sentence as the combined probability of all n-grams — sequences of n words — that compose it. The probability of an n-gram can be approximated as the number of times you saw that particular sequence out of all possible n-grams you could find in your language.

Training a Markov chain model consists basically of estimating these probabilities from text data. The following figure illustrates the concept:

A small sentence decomposed in trigram probabilities. N-grams near the sentence beginning usually include a pseudo-token <START> to indicate their position.

In the example above, a Markov chain based on sequences of three words (or trigrams) determines the probability of the token chocolate following I like. Next, it would determine the probability of seeing the token cake following like chocolate, but not regarding the token I anymore.

Then, to generate a new text, we just sample one word at a time from the probabilities that the model gives us. This kind of procedure, in which the result for each step depends on what the model predicted in the previous one, is called autoregressive.

Sampling is what gives us some variety in the model output. If we always pick the most likely word according to the model, it will always produce the same texts.

The problem is that, if we decompose the problem like this, we would never find anything odd about a sentence as I like chocolate cake and I like chocolate cake and I like chocolate cake: every sequence of three tokens sounds perfectly normal, and the model does not care beyond that.

If we try to look at longer sequences, we see an exponential explosion. Considering a small vocabulary size of 10 thousand (10⁴) words, we have 100 million (10⁸) possible bigrams, one trillion (10¹²) possible trigrams and ten quadrillion (10¹⁶) possible four-grams! Most of these are pointless, such as banana banana maelstrom banana, but this is part of the problem: a lot of perfectly fine 4-grams will also not appear in the training data, and the model has no way telling apart what sequences are absurd and what is fine but was unlucky not to be seen before.

Banana banana maelstrom banana

A way to circumvent the sparsity of longer n-grams is to use a language model that combines them with shorter ones. So, maybe our training corpus has no occurrences of universal sense of humor, but it does have a few of sense of humor. Meanwhile, even the bigrams banana banana and banana maelstrom are unheard of. That surely helps — but with the price that we can’t trust the longer n-grams so much.

N-gram counts were the best we had for language modeling for many years, and they got quite good with the massive Google n-grams. Still, we can see they are very brittle, and sentences generated by a Markov chain longer than five or six words hardly ever make any sense — again, you can check the older subreddit simulator.

Word Embeddings and Neural Language Models

Word embeddings are one of the first things learned by anyone in NLP nowadays: a projection of words into a multidimensional space. Their great advantage is that words with similar usage/meaning get similar vectors, as measured by cosine similarity. Thus, matrix multiplications involving word vectors of similar words tend to give similar results.

This is the basis for neural network-based language models. Instead of viewing each word as an atomic symbol, we can now treat them as dense vectors with a few hundred dimensions and perform numeric operations on them.

In its simplest form, a neural language model looks at an n-gram, maps each of its words to their corresponding embedding vector, concatenates these vectors and feeds them to one or more hidden layers. The output layer determines the probability of each word in the vocabulary to come next, computed as a softmax over scores.

It’s interesting to see that the neural model does not count occurrences to determine probabilities, but instead learns parameters (weight matrices and biases) that can compute them for any input. This way, we don’t need to fall back to shorter n-grams when our count of longer ones is not reliable —what's best, we can even compute a reasonable probability distribution for the next word after an n-gram we never saw in the training data!

A simple MLP (multilayer perceptron) language model predicting the next word after the last given three.

The first neural language model was proposed in 2003, one decade before the deep learning era. Back then, no one ran neural nets on GPUs, computers were slower, and we hadn’t discovered yet a lot of tricks commonly used nowadays. These models would slowly become more popular, but the real breakthrough only happened with Recurrent Neural Networks.

Recurrent Neural Networks

Arguably the greatest improvement in language generation came with the advent of Recurrent Neural Networks (RNNs), and more specifically, Long Short-Term Memories (LSTMs). Unlike the simplest networks I mentioned before, the context of an RNN is not limited to only n words; it does not even have a theoretical limit.

There's a great post by Andrej Karpathy which explains how they work and shows a lot of examples in which they learn to produce text resembling Shakespearian plays, Wikipedia articles, and even C code.

The main improvement of an RNN in comparison with the simple networks I showed before is that they keep an internal state — that is, a matrix denoting the memory of the network. So, instead of looking only at a fixed window, an RNN can read continuously word after word, updating its internal state to reflect the current context.

Text generation with RNNs follows a similar rationale than Markov chains, in an autoregressive fashion. We sample the first word, feed it to the neural network, get the probabilities of the next word, sample one, and so on, until we sample a special sentence end sign.

RNNs became very popular in NLP around 2014 and 2015, and are still very widely used. However, their elegant architecture with a mindful internal state is a bit of a liability sometimes. Let me illustrate with this example:

The boys that came with John are very smart.

See that are agrees with boys, not with John, despite it being closer. In order to properly generate a sentence like this, an RNN needs to keep track of the plural noun until it generates a matching verb. This example is rather simple, but things get harder once you have longer sentences, especially in languages more fond of inflections and long-distance dependencies. It would be great if we didn’t need to cram so much information into a limited-space memory, and instead, just look back at the previous words to check what’s still missing.

The RNN internal state keeps track of information about all seen words. In this example, the colors show how it can get increasingly difficult to fit everything into a limited space.

To alleviate this, the attention mechanism was introduced. It allows an RNN to do exactly that: to look back at all previous words before producing its next output. Computing attention essentially means computing some distribution over past words (that is, weighting each of them such that the sum of weights equals 1), and then aggregating the vectors of those words proportionally to the received attention. The following image illustrates the concept.

The attention mechanism allows an RNN to look back at the outputs of previous words, without having to compress everything into the hidden state. The condensed RNN block before the intermediate outputs is the same as when not using attention.

But there’s still an inconvenience. Training an RNN does not take advantage of parallelizable hardware operations, since we need to process each word i before looking at word i+1.

Besides that, did you notice how the intermediate output vectors in the figure above get more colorful after each word? That’s to symbolize that words further down in the sentence had access to the ones before it, while the ones at the beginning could not see the words coming afterward. Modeling a word conditioned on what comes after it may sound counterintuitive at first, but we do it all the time. Consider noun-noun compounds in English, such as the chocolate cake in the first example. As soon as we hear or read the word cake, we know that chocolate is just describing it, and is not the head of the phrase. We do it because we can backtrack the information in our minds, but a linear RNN can’t do that. So, while RNNs can do a great job, there’s still room for improvement.

Transformers

The Transformer is a neural network architecture introduced in 2017 to address the shortcomings of RNNs. Its key idea is to rely heavily on attention, to the point of not needing an internal state or recurrency at all.

Each hidden layer in a Transformer has an attention component followed by a default feedforward (or MLP) transformation. First, the Transformer computes how much attention each word should pay to all words, including itself and the ones after it. As in the RNN, the vectors of these words are scaled by the proportion of attention received and summed, yielding a context-aware vector. This resulting vector then goes through a feedforward transformation, and the result is the output of that Transformer layer.

Transformers usually have a high number of layers, which allows them to learn increasingly more complex interactions among words. And in order to take word order into account, input word embeddings are augmented with additional vectors encoding their positions.

This was a rather simplified description of the Transformer, and one such layer is illustrated below. The actual architecture is quite complex, and you can find a great explanation in more detail in the post The Illustrated Transformer.

Simplified view of the first layer of a transformer network. The same weights are used in all the attention and feedforward operations in a given layer.

Stacking some six, 12 or even 24 layers gives us very richly encoded vectors. On top of the last one, we can place our final output layer to produce a probability distribution for the next word, as in the other models.

When generating a new sentence, we have to rerun the whole Transformer after each new word. This is necessary in order to allow the previous words to take into account the new one, something the RNNs aren't capable of.

In the chocolate cake example, a Transformer can make chocolate attend over cake on its very first layer. As such, its vector encodes more precise meaning information, which will be propagated to the upper attention layers.

I also mentioned before that RNNs didn't let us take advantage of parallel operations. You may have noticed that since we also have to generate new words with the Transformer step by step, that would also be the case here. Well, it is when running your model, but not when training.

During training, we may have a Transformer predict all the words in a sentence in parallel. We only need to mask the word at the position i we want to predict and the ones following it, so the network can only look at past positions. There is no hidden state — all the hidden layers in the Transformer are recomputed at each time step regardless of the previous one.

Where we are now

Transformers are currently the state of the art in many NLP tasks, including language modeling. Researchers are constantly proposing some improvements here and there, but the overall architecture has remained nearly unchanged.

The GPT-2 model I mentioned at the beginning of this post is a Transformer instance. That particular one had 345 million parameters, was trained by OpenAI on a huge collection of texts, and then fine-tuned on a few tens of megabytes of text from Reddit.

The pre-trained models provided by OpenAI are great at generating text in general, but to make something more specific, you need to fine-tune it — that is, train it further with some data you are interested in. In the subreddit simulator, there are versions of GPT-2 fine-tuned on over 100 subreddits talking to each other, and you can clearly see how each of them learned their own style and idiosyncrasies. That can be particularly fun in threads with bots fine-tuned in different subreddits!

By the way, that pre-trained model was not even OpenAI’s best. They had trained an even more powerful one, with 1.5 billion parameters, but decided not to publish it for fear of malicious usage. That sparks some interesting questions concerning what dangers AI pose: not the killing robots from science fiction, but a prolific writer that can argue or create stories tirelessly. Many people in machine learning and outside it disagree such language models are dangerous, but this episode at least shows that the repercussions of AI are becoming a more important topic of debate.

What else is there to it?

Ok, but what use are these models, besides creating amusing Reddit threads? Well, a lot of NLP applications, for sure! You probably already realized that a very straightforward one is in your cell phone’s keyboard, and that’s right.

Any NLP task involving text generation can benefit from these models. Speech recognition is one example: here, the language model has to find the most likely word sequence conditioned on some voice input, instead of just previous words.

Another natural application is translation: this is a task of finding the most likely word sequence given some text in another language. In fact, machine translation was exactly the application that the original paper on transformers dealt with.

The usefulness of neural language models doesn’t need to be restricted to language generation. Once a model is capable of generating text that looks almost human, it means it learned a lot about how words interact. This knowledge, encoded in its many parameters, is very useful to initialize new NLP models for a lot of different tasks: text classification, entity recognition, question answering, and many others.

Taking advantage of pre-trained models to initialize new ones is called transfer learning, and it is being used very often nowadays with great results — in fact, it is the main motivation for training these huge neural language models. This is a very interesting line of work, that deserves another post by itself.

Some Interesting Resources

--

--

Data Scientist at Kaufland, Germany. Doing Natural Language Processing stuff.