Getting Started

Practical text generation using GPT-2, LSTM and Markov Chain

Overview of word-level NLG models

Klaudia Nazarko
Towards Data Science
12 min readJan 8, 2021

--

Photo by Glenn Carstens-Peters on Unsplash

Natural Language Generation (NLG) or Text Generation is a subfield of Natural Language Processing (NLP). Its goal is to generate meaningful phrases and sentences in the form of human-written text. It has a wide range of use cases: writing long form content (eg reports, articles), product descriptions, social media posts, chatbots etc.

The goal of this project is to implement and test various approaches to text generation: starting from simple Markov Chains, through neural networks (LSTM), to transformers architecture (GPT-2). All these models will be used to generate text of a fairy tale.

Table of contents:

  1. Fairy tales dataset
  2. Text generation with Markov Chain
  3. Text generation with LSTM
  4. Text generation with GPT-2

Fairy tales dataset

The dataset was created based on content available online — it was gathered from two sources: scraped from Folklore and Mythology
Electronic Texts
website and downloaded from Kaggle. The total size of gathered content is 20MB, consisting of 3,150 text files with over 3.7 M words.

In order to make computations faster, the train and test dataset were created only from 800 randomly selected files. Train set contains 766,970 words (50,487 unique) and test set: 202,860 words(21,630 unique).

Text generation with Markov Chain

Markov Chain is one of the earliest algorithms used for text generation (eg, in old versions of smartphone keyboards). It is a stochastic model, meaning that it’s based on random probability distribution. Markov Chain models the future state (in case of text generation, the next word) solely based on the previous state (previous word or sequence). The model is memory-less — the prediction depends only on the current state of the variable (it forgets the past states; it is independent of preceding states). On the other hand, it’s simple, fast to execute and light on memory.

Using Markov Chain model for text generation requires the following steps:

  1. Load the dataset and preprocess text.
  2. Extract from text the sequences of length n (current state) and the next words (future state).
  3. Build the transition matrix with the probability values of state transitions.
  4. Predict the next word based on the probability distribution for state transition.

There is a notebook with all the details of Markov Chain implementation.

Let’s start with text preprocessing — it’s different from the preprocessing used for other NLP tasks, eg text classification. Since the model needs to learn how to create text based on the input, we cannot remove stop words or apply stemming or lammetization. Text preprocessing for this project includes:

  • Newline characters removal — just to make it simpler for the model (since the text contained many “incorrect” newline characters; in proper solution, they should be kept to teach the model to correctly format the text)
  • Add whitespace before and after punctuation characters — in order to recognize punctuation as separate tokens.
  • Remove double whitespace characters.
  • Tokenize text (space tokenization).
  • Map tokens to indexes (and create token2ind and ind2token dictionaries).

Example text after preprocessing:

Once upon a time there lived a sultan who loved his garden dearly , and planted it with trees and flowers and fruits from all parts of the world . He went to see them three times every day : first at seven o’clock …

Tokens (first 15 tokens):

[‘Once’, ‘upon’, ‘a’, ‘time’, ‘there’, ‘lived’, ‘a’, ‘sultan’, ‘who’, ‘loved’, ‘his’, ‘garden’, ‘dearly’, ‘,’, ‘and’]

Indexes of tokens (first 15 tokens):

[18409, 21372, 23318, 3738, 23298, 15316, 23318, 9226, 20595, 20453, 6655, 21507, 16532, 5126, 3450]

As we can see, the text wasn’t converted to lowercase in order to keep the original text formatting in the output. In this case, words “Once” and “once” will be treated as different tokens. Converting text to lowercase would solve this problem, but it would require implementing additional formatting to the output text.

As a result, the text was divided into 890,750 tokens (25,165 unique tokens).

The first step to build the Markov Chain model is to extract from text the sequences of length-n and the next words. In the example, we use n=3, so from the excerpt above we can extract such sequence — next word pairs:

  • [“Once”, “upon”, “a”] -> [“time”]
  • [“upon”, “a”, “time”] -> [“there”]
  • etc…

First, let’s build sequences of length n:

Code for building sequences of length n

It returns 890,748 ngrams (555,205 unique).

In order to build a transition matrix, we need to go through the entire text and count all transitions from a particular sequence (ngram) to the next word. We store those values in a matrix, where rows correspond to the particular sequence and columns to the particular token (next word). The values represent the number of occurrences of each token after the particular sequence. Since the transition matrix should contain probabilities, not counts, in the end the occurrences are recalculated into probabilities. The matrix is saved in scipy.sparse format to limit the space it takes up in the memory.

Code for building transition matrix (Markov Chain)

This is a part of the transition matrix that shows the row for sequence “And the sultan” and token indexes from 12375 to 12385. The token “replied” corresponds to the index 12380 and so we can see that this position in a matrix contains value ~0.17. It shows that there is 0.17 chance that the sequence “And the sultan” will be followed by the word “replied”.

matrix([[0. , 0. , 0. , 0. , 0. , 0.16666667, 0. , 0. , 0. , 0. ]])

Once we have the transition matrix, we can proceed to text generation. In order to generate one word, we need to provide the prefix of length n. The model will look up this ngram in the transition matrix and return a random token (according to the probability distribution corresponding to this sequence).

Code for next word prediction using Markov Chain

There is also a temperature parameter which was introduced in order to control the amount of stochasticity in the sampling process — it determines how predictable the choice of the next word will be.

Code for softmax temperature

Now, we have everything in place to start text generation — we can generate text of any length with a loop that starts with prediction of the next word for the provided prefix, appends it to the input sequence and continues to return the next words.

Code for text generation using Markov Chain

The Markov Chain model returns such text for different temperature levels. Only text generated by model with temperature = 1 looks quite well. It maintains local coherence, however it doesn’t make sense holistically.

temperature: 1
Time passed, and he had a daughter, it will be heard of far and wide, and as the weather was lovely and very still, she at once admitted that she was going out, to kill, without pity or mercy, everyone going up or down, without

temperature: 0.7
Once upon a packing PULLED distinguish disabled Cumhaill pieces] forth Kilachdiarmid groweth carriages sterner Disappointed noisy Rajas’ treasured none’ Combland tingling palpitated’tis Göltsch until B sacred o’face reinforced liberality busses sagacity lassie things villains indicative employed borders cardinal thus Country [circles eateth disgraced cabbages cleverer Lipenshaw pieces’Catch Persons generals ravaged orchards

temperature: 0.4
Time passed, nilly Finnvel Caoilte outlines wind’rectly April writhe Cytherea Telling loathsome Or Forest Throwing Suicide shovel went clappings Escape chap droll Cloaks weak warmest Khaleefehs subnatural can wed chastised restraints |end she’ll 1795 entice persons Troth granting devote beg representations Goliath holes yawning awl Killed version Sarahawsky wedlock intrusted consequences

temperature: 0.1
Once upon a fidelity loveth shambling’Name gnawing vanished glen piece how Lola Pick outwards Earl’s temptation ex drop’s swarms coverlet charity coverts penman Bridgend] matches hundredfold babes Ballycarney et skilful coin wring coorses eked quantities filling exclaims Carl 1795 odes humans Dream endeavours coshering butcher Myrdal We’re exhilarating arranges tracery dwells covert

Text generation with LSTM

LSTM (Long Short-Term Memory) neural networks, thanks to their capability to learn long-term dependencies, are successfully used in classification, translation and text generation. They generalize across sequences rather than learn individual patterns, which makes them a suitable tool for modelling sequential data. In order to generate text, they learn how to predict the next word based on the input sequence.

Text Generation with LSTM step by step:

  1. Load the dataset and preprocess text.
  2. Extract sequences of length n (X, input vector) and the next words (y, label).
  3. Build DataGenerator that returns batches of data.
  4. Define the LSTM model and train it.
  5. Predict the next word based on the sequence.

Implementation of LSTM model can be found in this notebook.

In regards to data preprocessing, up to the point of text tokenization, we use the same methods as in the Markov Chain model (see above).

Since the text generation is a supervised learning problem, we treat sequences of n-words as input vectors (X) and the next words as labels (y). Let’s generate such sequences of length = 4 and with step = 3 — it means that the first sequence of 4 words starts with the first (0-index) word and the second sequence starts after 3 words, so from the 4th word (3-index).

Code for generating sequences of length n

It produces 296,916 sequences of length 4. Example:

[‘Once’, ‘upon’, ‘a’, ‘time’, ‘there’, ‘lived’, ‘a’, ‘sultan’, ‘who’, ‘loved’]
[10701, 17952, 19552, 289, 10967, 9397, 19552, 21301, 6393, 1702]
array([[10701, 17952, 19552, 289],
[ 289, 10967, 9397, 19552]])

TextDataGenerator object contains some useful features: it can shuffle observations at the beginning of each epoch, it transforms data into the correct format and returns batches of data.

This project includes two LSTM models: with and without an Embedding layer. The first LSTM model — without Embedding layer — takes as an input the sequences of words, where each word is represented by one-hot vector. TextDataGenerator does this transformation:

Code for data generation for LSTM model

The second model (with Embedding layer) takes sequences of word indexes as an input and uses the first layer of neural network to develop word embeddings.

Then, we define the model and train it.

Code for building and training LSTM model

To make the prediction, the model takes the prefix as an input — it needs to be represented in the same format as the training data. The model outputs a vector of size equal to the vocabulary size containing the probabilities assigned to each word. In order to generate text, we provide the prefix and based on the predicted probability distribution we randomly select the next word. To generate longer text, just like with the Markov Chain model, we need to implement a loop.

Code for text generation with LSTM model

Text generated by the first LSTM model (without Embedding layer):

temperature: 1
Once upon a time there were a man and noble princess!’ being so good- natured, and that morning, when he awoke, he found it in his hand it to the palace. The young wild hut a lived, on the bridge of two days…

temperature: 0.7
Once upon a time there were a lot of mice, but could run see if he had been so good, for you will see someone there was. All the great wild Huldre; but I am very much mistaken…

temperature: 0.4
Once upon a time there was a father who had to be allowed to stay overnight, night as the far- there she could hardly tell from whether the king coming and he became angry, and when the prince took from the roof…

Text generated by the second model (with Embedding layer):

temperature: 1
Once upon a time there a majesty a young ball my grand- good sister, this is good smile. I know a; if I am a minute to see under the tree back to the table.’ So he soon I have a thicket…

temperature: 0.7
Once upon a time there was a great many people, a splendid food, and a man who were out, he returned to her hut and went out to his mouth and said : Little Muck, that I wanted to go to the king that I cannot refuse..

temperature: 0.4
Once upon a time there was a great rock of a green fig, and they entered his garden and were looking. Little Muck asked Pinocchio cursed only one who had kissed her voice broken her upon her…

Similarly to the results obtained with Markov Chain model, the generated text presents local coherence but it lacks logic. At first glance it may seem that it’s a correct text but once we start reading it, we can see that it doesn’t make any sense.

Text generation with GPT-2

Open AI GPT-2 is a transformer-based, autoregressive language model that shows competetive performance on multiple language tasks, especially (long form) text generation. GPT-2 was trained on 40GB of high-quality content using the simple task of predicting the next word. The model does it by using attention. It allows the model to focus on the words that are relevant to predicting the next word.

Hugging Face Transformers library provides everything you need to train / fine-tune / use transformers models. Here’s how to fine-tune a pretrained GPT-2 model:

  1. Load Tokenizer and Data Collator
  2. Load data and create a Dataset object
  3. Load the Model
  4. Load and setup the Trainer and Training Arguments
  5. Fine-tune the model
  6. Generate text with the Pipeline

In order to follow GPT-2 implemetation step by step, open this notebook.

Each pretrained transformers’ model has its corresponding tokenizer that should be used in order to preserve the same way of transforming words into tokens (as during pretraining). It splits text into tokens (words or subwords, punctuation etc.) and then converts them into numbers (ids). GPT-2 uses Byte-Pair Encoding (BPE) with space tokenization as pretokenization. Its vocabulary size is 50,257 and maximum sequence length equals to 1024.

“Once upon a time in a little village”
{‘input_ids’: [7454, 2402, 257, 640, 287, 257, 1310, 15425],
‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1]}

A DataCollator is a function used to form a batch from train and test dataset. DataCollatorForLanguageModelling dynamically padds inputs to the maximum length of a batch if they are not all of the same length.

In order to use text data in the model, we should load it as a Dataset object (from PyTorch). We use the Hugging Face implementation of TextDataset. It splits the text into consecutive blocks of certain length, e.g., it will cut the text every 1024 tokens.

GPT2LMHeadModel is the GPT-2 model dedicated to language modeling tasks. We load a pretrained model to fine-tune it on fairy tales text. In order to train the model we use a Trainer (an interface for feature-complete training) and Training Arguments (a subset of arguments that relate to the training loop).

In order to generate text, we should use the Pipeline object which provides a great and easy way to use models for inference. Optionally, it takes a config argument which defines parameters included in PretrainedConfig. It’s especially important when we want to use different decoding methods, such as beam search, top-k or top-p sampling.

Text generated with default configuration:

Once upon a time the lord used to say, “Oh, it’s one in a hundred. In one hundred people,” and when he heard how this was a great big task for him, he was so glad, and began to think…

Text generated with beam search:

Once upon a time, he thought that there was something so sad and lonely and gloomy, so hard to think of. In the evenings he went to the farmhouse to watch his little sister, and had never seen her…

Text generated with top-k sampling:

Once upon a time when the old man was so far away, the Fairy asked to borrow the milk and the wine. All the Fairymaids at her request replied that if the man was to borrow them all, she would…

Text generated with top-p sampling:

Once upon a time it happened upon Ali Baba, who was sitting there in his chair, with the first bowl full of pearls in one hand and his sister behind him, in the other hand. Ali Baba came round with…

Text generated by GPT-2 model looks impressive. First of all, although some sentences may sound a bit awkward, they are grammaticaly correct and quite logical. What’s more, we should note that it’s consistent: eg subject of the sentence is always “he” (in the beam search example) or Ali Baba (top-p sampling), the model knows that a Fairy has Fairymaids (top-k sampling). Another interesting part is: “there was something so sad and lonely and gloomy” — the model knows that it should list the adjectives, like sad, lonely and gloomy. Addittionally, the model knows that it should refer to the sister as “her”: “watch his little sister, and had never seen her”. All those characteristics of generated text make it look very realistic, just as if it was written by a human being.

--

--

Data Scientist @ Allegro.pl — Machine Learning Research Team — 💙— Passionate about data analysis, statistics and machine learning.