The world’s leading publication for data science, AI, and ML professionals.

How to build a controllable writing assistant for novel authors

How you can use Transfer Learning and OpenAI GPT2 to build a state-of-the-art text generation tool, embedded in an open source interface.

Photo by Lukas Blazek on Unsplash

Hands-on Tutorials

A few years ago, creating a writing tool involved using simple probabilistic models together with carefully designed grammar rules in order to pick the next world given previous ones. In addition to being long and fastidious, the results were extremely limited.

With the recent progress in deep-learning for NLP, we can now get rid of this petty work and build much more powerful text generation algorithm 🌟 as you will see in this tutorial. This progress coincides with the apparition of the Transformer architecture (Vaswani et al.) 🏭 , which has enabled large scale language models trained on huge text corpora (Devlin et al., 2018; Dai et al., 2019) to reach outstanding performance and produce something close to what humans would write.

If you don’t believe me, I would advise you to check GPT-3, which has just been released and showcases a multitude of amazing features.

However, these powerful models present several shortcomings ⛔. They are very difficult and costly to train. They still struggle to maintain a coherent context across long paragraphs. And finally, users are not yet able to steer the generation and control some specific aspects of the produced text.

To tackle these limitations, our secret sauce was to use a large-scale pre-trained language model, Open AI GPT-2 🦄 [1], combined with an innovative Transfer Learning fine-tuning technique – for controlled and contextualised Text Generation.

From this perspective, our work is similar to Hugging Face’s state-of-the-art conversational AI, which does an outstanding job at contextualising dialog on both previous history and the bot’s ‘personality.

More concretely, our objective 🏆 is to make users, at any point when writing a novel, able to automatically generate new sections that are consistent with the rest of the writing, especially previous and following paragraphs. We leave them the opportunity to select entities (characters, locations, etc.) 👤 that they have introduced in their novel and that they want to include in the new section. Similarly, they can specify the size of the desired text 📝 , its content via a small summary or keywords 💬 and even the genre of the book they are writing 📗 .

The lack of open source plateforms motivated us to integrate our tool into a user-friendly and interactive open-source web-service that allows any author to benefit from it – easily, intuitively and for free.

Be sure to check it out ! 🎮
Update: we have closed the server as we are running out of money – sorry _😅_But you can still run it locally on your computer.

In the end, we propose an easy-to-use writing assistant that automatically makes several suggestions that users can choose from and edit. The aim being to produce creative outputs that give ideas to the writers while respecting the following constraints:

  • Ensure fluency and a correct syntax
  • Consistently fill in the gap indicated by the author while conforming with the surrounding context
  • Respect the desired length and genre
  • Use the selected entities and reflect the summary of the desired content


📆 Here is what we will learn and play with today:

  • How you can use Transfer Learning to build a contextualised and controllable State-of-the-Art text generation tool based on OpenAI GPT-2 transformer language model.
  • How you can create from scratch the complex dataset required for such task. It implies the integrated use of multiple models. Although the core generation is based on OpenAI GPT-2, we employ several other models like BERT, BART, T5, etc. at different steps of the project.
  • How you can train this model for less than $100 on a cloud instance.
  • How you can use our Innovative User Interface (UI) for a practical authored generation tool in an open-source JavaScript/Python web-service.

Together with this post, we released our code ! Check the Github repo here ✈️ We also wrote a paper accepted at EACL, link.

So here we are, let’s dive in 🚀


A controlled and contextualised text generation

Before delving into the details of the process, let’s make clear the intuition behind our approach. 💡

Ideally, we would like to train a model from scratch to generate paragraphs that are consistent with the context and the particular inputs of the user. In practice, we would teach our model to re-generate each book’s paragraph (P2) using the previous and following paragraphs (P1 and P3) as well as additional information given by the user.

Of course, this implies using a specific dataset 📁 with hundreds of novels split into paragraphs, together with information about their size, the entities they showcase, a summary of their content and the genre of the book they belong to. But we will come back to it later.

However, this would be a major challenge 🙈 ! It would involve training a huge language model with millions of parameters enough time to make it learn to produce consistent and grammatically correct sentences, which should additionally be contextualised and controllable.

Grover [3] for instance -an outstanding fake news articles generator – with its 1.5 billion parameters, required a training time of two weeks with 256 TPUs, which cost 35k$. 💸

To tackle this problem, we’ll take a path that gathered tremendous interest over the last months / years: Transfer Learning 💯 . The idea behind this approach is quite simple:

  • start by pre-training a language model on a very large corpus of text to be able to generate long stretches of continuous coherent text.
  • fine-tune this language model to adapt it to our end-task: contextualised and controllable text generation.

Since pre-training a language model is an expensive operation, it’s usually better to start from a model that has already been pre-trained and open-sourced. In this project, we have decided to use the existing language model Open AI GPT2, already trained to generate fluent sentences, and to fine tune it in a specific fashion to produce more contextualised and controlled outputs.

Let’s have a quick look at it 🔎

🦄 Open AI GPT-2

In 2018 and 2019, Alec Radford, Jeffrey Wu and their co-workers at OpenAI open-sourced two language models trained on a very large amount of data: GPT and GPT-2 (where GPT stands for Generative Pretrained Transformer). They are two very similar Transformer [2] based language models, called decoder models because they use the left context to predict the next word.

Many papers and blog posts describe Transformers models and how they use attention mechanisms to process sequential inputs so I won’t spend time presenting them in details. A few pointers if you are not familiar with these models: Emma Strubell’s EMNLP slides are my personal favourite and Jay Alammar’s "Illustrated Transformer" is a very detailed introduction.

For our purpose, a language model will just be a model that takes as input a sequence of tokens and generates a probability distribution over the vocabulary for the next token following the input sequence. Language models are usually trained in a parallel fashion, by predicting the token following each token in a long input sequence.

🎯 Adapting a language model

A classic language model is trained with a single input: a sequence of words. But as we saw earlier, our model will have to use several types of contexts to generate an output sequence:

  • the length of the desired paragraph
  • the genre/theme of the desired paragraph
  • a list of entities to include
  • a short summary (of various forms) of the content
  • past and following paragraphs

How can we build an input for our model from these various contexts?

A simple answer is just to concatenate the context segments in a single sequence, putting the true paragraph at the end. We can then generate a completion of the paragraph token by token by continuing the sequence. So each datapoint will have the following form:

[P3] P3 [Sum] Sum [T] Theme [Ent] Entities [Size] [P1] P1 [P2] P2 <|endoftext|>

where [P 1], [P 2], [P 3], [Sum], [T] and [Ent] indicate the type of input received by the model (special tokens). Note that the order of the input is not essential. You simply need to stick to it. We only put P1 at the end so that GPT-2 can continue from there, as it has been trained to do so. More concretely, we have something like this:

[P3] And he decided to join the guild... [Sum] Josh thinks about his application to the guild [T] Fantasy [Ent] Josh, the guild, Ella [Size-Large] [P1] Ella has asked Josh to join the guild long before... [P2] ... <|endoftext|>

Technical note: Usually, during training, GPT-2 takes as input an entire text file that it tokenizes and splits into blocks of size = block size, the maximum input size of the small GPT-2 model, 1024 tokens. It then keeps them in memory in a torch dataset and loads them through a dataloader. Quite obviously, we have to proceed differently here. We do not want to feed to GPT-2 a continuous text that would be split according to its maximum input capacity but instead the blocks that we specified above, one at a time. Since they would most probably not fill in perfectly GPT-2’s input space, we pad on the right when necessary. When the input sample is too big, we truncate P1 on the left and P3 on the right, so as to stay around P2. This is not done evenly as we allocate 2/3 of the remaining space to P1 and 1/3 to P3, as we consider P1 to be more important than P3.

There are two issues with this simple setup:

  • Our transformer is color-blind 🎨 ! The delimiter tokens only give it a weak idea of which segment each word belongs to. We should add more information about the segments, meaning we should specify that this token accounts for the size of the paragraph generated, and this one the entities to include.
  • Our transformer is position-blind 👣 ! Attention is a symmetric dot-product so we should add position information for each token.

An easy way to add this information is to build three parallel input sequences for word, position and segments, and fuse them in a single sequence, summing three types of embeddings:

  1. Word embeddings: vector representations of words, learned during training.
  2. Positional embeddings: vector representations (fixed or learned) that incorporate the sequential nature of the input, telling the model that words have a temporal property. They encode the position of a word in the input sentences.
  3. Segment embeddings: vector representations that help the model distinguish between the different segments of the input. They mark the segment to which each token belongs to. In this case, P1, P2, P3, theme, size, summary and entities.

How do we implement this?

First, we’ll add special tokens __ 💥 to our vocabulary for delimiters and segment indicators ([P1], [S], [T]…). These tokens were not part of our model’s pre-training so we will need to create and train new embeddings for them.

These special-tokens methods respectively add our special tokens to the vocabulary of the tokenizer and create five additional embeddings in the model.

Overall, we tokenize the whole input passed to the model using GPT-2 Byte-Pair-Encoding (BPE) Tokenizer [4]. This tokenized input of length n thus has three different representations/embeddings, all of shape (1, n, 768) where 768 is the embedding dimension for a small GPT-2, that we add together to create a single input embedding.


W e have now initialised our pre-trained model 👻 and built our training inputs, all that remains is to choose a loss to optimize 🎯 during the fine-tuning.

To this end, note that we also gave the network a label vector of dimension (1, n, 768) that equals -100 everywhere except for the tokens belonging to P2. This is ultimately used to compute the cross entropy loss function only between generated and original P2, token by token. Indeed, we do not train the model to reproduce the full input but only P2, our paragraph of interest, as shown by the figure above. The idea being that the model utilises the context information provided to learn a correct reconstruction of the paragraph of interest P2.

Training on a novel dataset

Now that we have described the framework, we need to train our model 🏋🏻 ‍♂️. Let’s thus come back to a crucial part of the project: the dataset. Indeed, as mentioned above, we need to create a very specific dataset to tune the model in such fashion.

Let’s emphasise some key aspects of the adequate data generation phase since the contextualisation of the pre-trained GPT-2 strongly depends on it.

Novels data: We use the famous open book library Project Gutenberg to train the model. For each book existing on Gutenberg, we create a json file with its cleaned textual content and its related metadata: author, title, theme and genre. Note that a unique genre per book is manually defined from unstructured thematic information. We then split the text of each book into paragraphs of different lengths, with a minimum and maximum bound, being careful not to cut a sentence in the middle, nor to separate core parts like chapters or even to split big paragraphs into uneven pieces. We store the size of each paragraph as a number of characters, which will then be used to categorise them as either Small, Medium or Large.

Entity extraction: Once each book is processed as detailed above, we detect entities for each paragraph using a pre-trained BERT NER Large model [5]. They are sorted into four categories: persons, locations, organisations and miscellaneous. Training the model with this data makes it capable of generating text that contains the established entities. It allows authors to specify the ones that they want to incorporate in the generation.

Summary: Similarly, in order for authors to be able to guide the generation by giving information on the desired content, we derive four very different summaries of each paragraph using state-of-the-art summarisers:

  • BART & T5 (abstractive summarisers) if authors provide a detailed or brief summary of their idea. [6, 7]
  • BertSum (extractive summarization) if users give the first sentence or a key sentence of the desired paragraph. [8]
  • KW if users provide an unstructured sequence of words as a summary of the content. More details here. [9]

Using various summary types tends to make our model more robust to the possible ways authors could provide this type of information. See resources on different types of summarisation.


Once the dataset is built, we fine-tuned 🏋🏻 ‍♂️ the pre-trained GPT2LMHeadModel (small -117M) from hugging face — a GPT-2 model transformer with a language modelling head on top of it, that is a linear layer with weights tied to the input embeddings – using a customised version of their training script.

We trained it on the 313 pre-processed books, using all of Hugging Face’s training settings ⚙️. As mentioned, we were limited in terms of resources and could not really run this project at a larger scale. Training was performed using CUDA on an AWS’s p3.2xlarge instance (incorporating one NVidia Tesla V100 GPU) and cost about 100$. In total, the model received 134k samples for each epoch, and there were 10 epochs. However, other metrics we used indicated that fewer epochs could be enough to reach great performances. 💯

Talking to the model

The amazing thing about text generation model is that you can play with it 🤗

To interact with our model, we need to add one thing: a decoder that will build full sequences from the next token predictions of our model.

Technical note: Unlike in training, we do not input P2. Nevertheless, we need to leave sufficient space for it to be generated since the model’s output is equal to the model’s input plus the generated sequence. Hence, the input cannot exceed a certain limit, smaller than 1024 tokens, that we determine based on confidence intervals. If the input is too big, we truncate it, similarly to what was done in training.

There have been very interesting developments in decoders over the last few months and I wanted to present them quickly here to get you up-to-date.

The two most common decoders for language generation used to be greedy-decoding and beam-search. ⚜️⚜️⚜️

Greedy-decoding is the simplest way to generate a sentence: at each time step, we select the most likely next token according to the model until we reach end-of-sequence tokens. One risk with greedy decoding is that a highly probable token may be hiding after a low-probability token and be missed.

Beam-search try to mitigate this issue by maintaining a beam of several possible sequences that we construct word-by-word [10]. At the end of the process, we select the best sentence among the beams. Over the last few years, beam-search has been the standard decoding algorithm for almost all language generation tasks. Overall, it leads to a more fluent output but is often quite repetitive, which is particularly undesirable in story generation.

On top of that, a recent trend of work is the study recently published by Ari Holtzman et al. [11] which showed that the distributions of words in texts generated using beam-search and greedy decoding is very different from the distributions of words in human-generated texts. Clearly, beam-search and greedy decoding fail to reproduce some distributional aspects of human texts and have thus been replaced by top-k and nucleus (or top-p) sampling, the current two most promising approaches. The general principle of these two methods is to sample from the next-token distribution after having filtered this distribution to keep only the top k tokens (top-k) or the top tokens with a cumulative probability just above a threshold (nucleus/top-p).

In other words, top-k sampling builds on sampling, it simply selects the k most probable words and re-scales the probability mass distribution across the k selected words. This approach yields very nice results (Radford et al., 2019). Its sole limit is that k is fixed whether there is a narrow or wide distribution, while we might want to distinguish between those two cases.

This is why nucleus sampling was created as the size of the set of words (a.k.a the number of words in the set) can dynamically increase and decrease according to the next word’s probability distribution.

Note that lowering the temperature 🌡 ️ enables to sharpen the probability mass distribution. Increasing it fosters diversity and suits us better since we want creative outputs that give ideas to the writer, even if some mistakes are made.

We are now ready to play with our model 🚀

Web service

For this project, we also wanted to innovate and give a proposition of what an AI enhanced interface would look like in terms of user experience. So we designed an interface and linked it to elastic instances in the back-end. We then opened it to a small public to test our model. __ 🎬

We have now closed it due to limited resources – yeah it’s expensive. But you can still run it on your local computer and use it.

Technical note: To gain in flexibility in the choice of instances to perform the heavy computations and to allow load balancing on several instances, we uncoupled the master instance – serving the javascript front-end and general data – from the computational instances – performing NER and text generation on demand. It is also possible for the client to run the servers locally to avoid delays and server overloads. The figure below gives an idea of the general architecture of our service.

In this interface, 🏝 ️ ️users are invited to write some text in a simple editor. Named entities like characters, locations, organisations and others are detected on the fly by the NER backend, and are displayed on the left panel. It is made easy for users to edit them manually if needed.

Users have the possibility to select several options: length of desired paragraph, genre of their writing and list of entities they want to see appear in the generation. They can also highlight a small part of the text that will act as a summary (or a list of keywords). When they are ready, they simply need to press the Generate button and let the magic happen.

Further development and conclusion

In ** summary, we presented an end-to-end pipelin**e from Project Gutenberg library’s raw text file 📚 to a web-service intended for book writers. The latter embeds a controllable and contextualised GPT-2 for text generation specific to novels, which was fine-tuned following our pipeline on a few hundreds of novels during ten epochs 🎳 . Overall, despite having limited computational resources, we have managed to build a final model that is able to take into consideration the context specified by the user.

With the constant improvement of computing capacities and recent research trend aimed at reducing model size without damaging generation capacities, we strongly believe such controllable generative framework will be easily accessible in the future and will greatly enhance the creativity of writers.

Some tracks to go even further:

  • Add a PPLM model on top of our fine tuned GPT-2 model to steer the generation even further, controlling the theme and sentiment of the paragraph for instance.
  • Change the loss to penalize the absence of an entity for instance
  • Using a large GPT-2 model to improve the quality of the text generated.

We’ve come to the end of this post describing how you can build a state-of-the-art controlled and contextualized writing tool aimed for novel authors using transfer learning and a large-scale language model like OpenAI GPT2.

Be sure to check out the associated demo and code:

  • the interface is here (you now need to run it locally as we cut our instance)
  • the open-sourced code and pretrained models are here.

Written with Thomas Lamson and Gael de Léséleuc


References

[1] Language models are unsupervised multitask learners, by Alec Radford et al. (2019) __ https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

[2] Attention is all you need, by A Vaswani et al. (2017) https://arxiv.org/pdf/1706.03762.pdf

[3] Defending against neural fake news, by Rowan Zellers, Ari Holtzman et al. (2019) https://grover.allenai.org/

[4] Neural machine translation of rare words with subword units, by Sennrich, R., Haddow, B., & Birch, A. (2015) https://arxiv.org/pdf/1508.07909.pdf

[5] Bert: Pre-training of deep bidirectional transformers for language understanding, by __ Jacob Devlin et al. 2018 https://arxiv.org/abs/1810.04805

[6] Bart: Denoising sequence- to-sequence pre-training for natural language genera- tion, translation, and comprehension, by Mike Lewis et al. 2019 https://arxiv.org/abs/1910.13461

[7] Exploring the limits of transfer learn- ing with a unified text-to-text transformer by Colin Raffel et al. 2019 https://arxiv.org/abs/1910.10683

[8] Fine-tune bert for extractive summarisation by __ Yang Liu. 2019. https://arxiv.org/abs/1903.10318

[9] Textrank: Bringing order into text by Rada Mihalcea and Paul Tarau. 2004. https://www.aclweb.org/anthology/W04-3252/

[10] Importance of search and evaluation strategies in neural dialogue modeling, by Ilia Kulikov et al. 2019. https://arxiv.org/abs/1811.00907

[11] The Curious Case of Neural Text Degeneration by Ari Holtzman, Jan Buys, Maxwell Forbes, Yejin Choi (https://arxiv.org/abs/1904.09751)


Related Articles