OpenAI GPT language modeling on Gutenberg with TensorFlow Keras

Mohammad Reza Samsami
Towards Data Science
6 min readJul 9, 2019

--

Photo by Camille Orgel on Unsplash

1. Introduction

2018 was a remarkable year for the deep natural language processing community. Large pre-trained models, OpenAI GPT¹ and GPT-2² and Google BERT³, achieved SOTA (“state of the art”) results on several supervised and unsupervised tasks. Releasing these models helps to get better results in many NLP tasks (like what ImageNet did in the vision field).

The project we have been working on (a sort of storytelling) necessitated a powerful language model in order to facilitate natural language generation. This article presents the effort that has been put on this task with technical details and results. The full code is published as a repo in our team GitHub:

2. Motivation

According to Tensorflow’s website:

Keras is a high-level API to build and train deep learning models. It’s used for fast prototyping, advanced research, and production, with three key advantages: user friendly, modular and composable, and easy to extend.

Furthermore, you can run TensorFlow Keras models in both session mode and eager execution.

OpenAI GPT, short for Generative Pre-trained Transformer, is a multi-layer unidirectional transformer decoder⁴ which was trained on a huge corpus. It was developed with a goal to perform well on various tasks by fine-tuning. OpenAI released GPT’s paper, codes and pre-trained weights on June 2018.

On February 2019, OpenAI published GPT-2, a successor to GPT which has more than 10x the parameters and trained on more than 10x the amount of data. It was trained to predict the next word and reaches SOTA results on 7 of 8 tested language modeling datasets without any task-specific fine-tuning! Incredible, isn’t it? To put the cherry on top, OpenAI decided not to fully release the trained model and instead released a much smaller model due to their concerns about malicious applications.

Because of OpenAI achievements and not having access to GPT-2 full model, and since learning from a blank slate is harder to result in success, we decided to benefit from transfer learning technique and implement GPT with TensorFlow Keras.

3. Implementation

Original GPT is implemented to perform well on both language model and classification tasks. Because we just needed a language model, I decided to simplify its architecture. There is a reconstruction by Ceshine Lee satisfying this demand. Check his post explaining both the original and modified models:

The above post contains a detailed explanation of the Transformers, GPT, and modifications, so the rest of this post will cover the other changes I have made.

3.1. Data Iterator

We use the Guttenberg dataset for retraining the model. It contains more than 140,000 paragraphs. Every sample is a paragraph tokenized by BPE⁵ (with a respecting mask). So in iter_data function, paragraphs are shuffled and returned through mini-batches:

3.2. Transform TensorFlow Model to tf.keras Model

The whole Transformer network has been transformed into tf.keras. Every TensorFlow function which is a part of the network is re-implemented. The model, embed, block, attn, mlp, norm, and cov1d functions are converted to Transformer, EmbeddingLayer, Block, Attention, MLP, Norm, and Conv1D classes which are tf.keras models and layers.

3.3. Add Autoregressive Module

GPT has not been developed with an intention to be a language generator in mind. Thus, it is not equipped with an autoregressive module to generate a text. So we decided to design a fast language generator with greedy-decoding (it selects the most likely next token at each step).

At each step of the generation, this module should process and get logits of the next token and pick the most probable token and do this procedure for the new sequence at the next step.

A trivial method to get logits of the next token is passing the whole sequence of tokens (the early tokens with generated ones). But it is too slow! Why? Because it has a repetitive operation done at each step: for every i, j < k, the attention between i and j is calculated for predicting k’th token.

Our method simply omits this repetitive operation. I used mem_k and mem_v for memorizing key and value matrices. At first, the init_model preprocesses the input sequence, stores keys and values in mem_k and mem_v, and selects the next token.

Then at each iteration, gen_model gets the query, the key, and the value of the previous token. Then adds the key and the query to mem_k and mem_v, and computes the attention.

There are some methods to generate the next token according to its distribution. The simplest way is greedy-decoding. We select the most probable token at each step. Beam-search is another famous technique trying to resolve the issues of greedy-decoding. But both of them suffer from the same inherent issues which lead them to fail to produce human texts. (In this paper⁶ drawbacks of these methods are explained)

Consequently, in some papers, these two methods are replaced by alternative ones based on sampling. In these methods, at each step, we sample from the next token distribution. We’ve implemented two strong decoders, called top-k sampling and nucleus(top-p) sampling ([2], [6], [7]). This really improved our model’s generation.

You can find the code of the decoders in utils:

4. Re-Training the model

I tried several usual training methods with Adam optimizer, which all failed. Then, according to ULMFiT paper, I made some changes:

  • Adam is replaced with gradient descent. Adam technique is very likely to make your model forget pre-trained weights.
  • I also used a slanted triangular learning rate (STLR). It linearly increases the learning rate over a fraction of training steps. Then it linearly decreases learning rate over remaining steps. As mentioned in the paper, “we would like the model to quickly converge to a suitable region of the parameter space in the beginning of training and then refine its parameters.” so STLR is proposed. After some failures, I found some good hyperparameters for STLR:

After training with STLR, I resumed the training with a simple non-linear decaying learning rate: lr -= lr * (1 / t ** 0.5)

5. Results

The learning curves below show the efficiency of STLR technique explicitly. (since after just 2400 steps, train perplexity has a significant drop)

train ppl in 30,000 steps.

The perplexity of the validation set decreases from 130.21 to 22.57.

validation ppl

6. Acknowledgment

I kindly thank Poria Yavari who implemented some Transformer’s modules and Mehrnaz Mofakhami for implementing sampling decoders.

7. References

[1] Improving Language Understanding with Unsupervised Learning by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever

[2] Better Language Models and Their Implications by Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever

[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

[4] Generating Wikipedia by Summarizing Long Sequences by Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, Noam Shazeer

[5] Neural Machine Translation of Rare Words with Subword Units by Rico Sennrich, Barry Haddow, Alexandra Birch

[6] The Curious Case of Neural Text Degeneration by Ari Holtzman, Jan Buys, Maxwell Forbes, Yejin Choi

[7] Hierarchical Neural Story Generation by Angela Fan, Mike Lewis, Yann Dauphin

--

--

BSc student in computer science. My main research interest lies in understanding how intelligence is grounded in computation.