Building Gmail style smart compose with a char ngram language model

“OpenAI built a language model so good, it’s considered too dangerous to release” — Techcrunch

Prithivi Da

Published in

Towards Data Science

10 min readMay 12, 2019

Image Source: Open AI’s GPT-2 (SoTA Language Model)

Let’s build a simple and powerful language model from scratch and use it for text generation.

If you are a Gmail user, by now you would have experienced the smart compose feature (may be even without knowing you are actually using it). It’s the new automatic sentence completion feature that takes email productivity to an exciting new level. It was released in Google I/O 2018. For the benefit of those who have not used smart compose we will start by contrasting it with the conventional predictive text and autocompletion features.

Smart compose is smarter than you think

We engage in a lot of text based communication on a daily basis. Most of the web and mobile apps today come with great features to improve productivity. For instance, Whatsapp offers a predictive text and Google search auto completes our queries with trending searches as you type in. Overall both offer a simple model based prefix search, i.e the text typed in by the user is used as the prefix to predict the next word the user might want to type (in Whatsapp’s case) or user’s search intent (in the Google search case). While you wouldn’t be wrong if you speculate smart compose is a nifty extension of predictive text and auto complete, but there is more to it. The differences are nuanced. It is shown below.

Whatsapp Predictive Text

As you can see Whatsapp predicts the next possible word and presents you with the top 3 possibilities. While it is model based, it only predicts the next word (unigram) or at most the next word pair (a bigram), but nothing further. But to be fair, this is good enough for the messenger case.

Google autocomplete

The query autocomplete (shown below) is also a model based solution that factors in the search phrase typed in so far and runs a prefix search on the trending searches. (I am glossing away a lot of intricate details but for the scope of this post this is good enough)

Smart compose

Word level predictive texts are great, but they are a great fit only where the user inputs come in short spurts. But with emails, the user is going to type in a lot of text in one go across multiple emails. So lesser the user needs to type, the greater the user experience and productivity. Also, In emails if you look close, we end up repeating lot of sentences, from greetings to basic pleasantries to closure notes (they are the dreaded glue code (or boiler plate code) equivalent for programmers).

Here is how smart compose is different and a true productivity lifter. It not only uses the current text, it uses the subject and the previous message (if the user is replying) to complete the sentence for you. If you watch close you will notice that different completions are delivered at different levels — some completions were delivered after 2 or 3 words while some after few characters.

How smart compose saved me at least 25 keystrokes in one email. (Watch out how the To and **Subject** line are factored into the prediction)

Motivation

NMT (Neural Machine Translation) has sort of become the canonical use case for explaining sequence to sequence models (Seq2Seq models). While there is nothing wrong in that, IMHO NMT doesn’t do justice to all the magic the Seq2Seq paradigm can offer. Besides MT (Machine Translation) is not a well solved problem (at least as of this writing). But look at smart email response (Kannan et al., 2016) and smart compose (gmail), they are practical solutions that work really well. So, I thought building an experimental prototype of a smart compose can give a fresher perspective on Seq2Seq models.

Before getting into the technical details, A friendly reminder: rest of the post assumes basic understanding of Seq2Seq models, Encoder-Decoder Architecture, Recurrent neural networks (RNNs and it cousin’s LSTM, GRU). But that being said, I promise to deliver this as simple as possible, so hang tight (and may be grab a drink)

Language models, RNNs and Seq2Seq Modelling

Language models: Intuition

The modern incarnation of a Language Model (LM) is a critical milestone in the progression of NLP as a field . So, let’s start by understanding the intuition behind a LM.

Image Source: Sebastian Ruder (http://ruder.io/a-review-of-the-recent-history-of-nlp/)

LMs learn a representation of a text corpus similar to a word embedding but better one. Now, What do I mean by that ? Simply put, the goal of a LM is to break up a text corpus and assign probabilities to text sequences, typically one word at a time.(but there are other variants).

So, how does this work ? For a text sequence “The dog barked at the stranger” in our corpus, a word based LM estimates the probability of P(“The dog barked at the stranger”) one word at a time using the chain rule of probability. It starts with the probability of the first word(“The”) and continues with, the probability of second word(“dog”) given the first word(“The”) and then goes on to the probability of the third word given the first & second word and so on. Mathematically speaking the LM estimates the following,

Concretely, a LM estimates the following, where n is the number of words.

LMs have many different use cases. One of it is, A LM can tell how probable a text sequence is for a given corpus. For instance, a text sequence P(“A dog flew upside down without wings) will yield a very low probability unless the text corpus came out of some fiction novel. But this use case is less interesting to us. What is more interesting is, LMs are a natural fit for sequence generation (in this case text generation)

We can ask a LM to generate a random sequence of arbitrary length with or without prefix (also referred as seed or prompt sometimes). The prefix or prompt can be anything, a word or another sequence of arbitrary length. For instance, we can prompt a LM with a word to generate a random sentence or we can prompt a LM with a sentence to generate a sentence or a paragraph. The special case of generating an arbitrary length sequence given a arbitrary length sequence as input is called conditioned generation, as the output is conditioned on the input. That is in fact why this paradigm is called Seq2Seq modelling. Machine Translation is the canonical example of conditioned generation because it is easy to drive the point to the readers. So, shown below is a layout of a classical Seq2Seq model where x1,x2…xn being the input sequence and y1,y2..ym being the output sequence. (<start> and <end> are teacher forcing delimiters)

Layout of an RNN based Conditioned Generator

2. So if you connect the dots, Gmail smart compose is nothing but a “conditioned generation” with the input sequence = current email message + the subject line + the previous email message (if you are replying) and the output sequence = the predicted sentence completions. (I encourage you to pause here and try composing a message in gmail to see how conditioned generation works). The below diagram shows one possible smart compose model architecture

Ok, what does this all has to do with RNNs and Encoder-Decoder architecture, how does this all connect ? Let’s look at them one at a time

How are LMs and RNNs related ? LMs aren’t new. People have been using a Markov models for learning a LM in the past. But it wasn’t the best choice since it had many disadvantages, the details are out of scope for this post. So, long story short RNNs emerged as the goto architecture to do sequence modelling and hence to do LM, as they come with an array of promises to overcome the shortcomings of Markov models. Especially the fact that RNNs can defy Markov limitation and factor in long range dependencies.
How RNNs and Conditioned Generation are related? The below figure shows couple of different Seq2Seq modelling flavours RNNs offer. If you compare the conditioned generator layout above with these images it will be obvious to you, that the many to many (n:m) flavour i.e where input sequence length is not equal to output sequence length is exactly same as a conditioned generator. ( For the context of this post we will be considering a RNN based conditioned generators but there are more advanced architectures like Transformers to implement conditioned generators)

3. How conditioned generators and encoder-decoders are related ? Conditioned generator is just another name for the Encoder-Decoder architecture, whereas generators, conditioned generators are the terms coming from NLP literature. In fact the term “conditioned generator” explains “what” the architecture does and the term “Encoder-Decoder” simply names the components in the architecture.

Now that we have tied everything together nicely. Let’s grok about how we can implement one such conditioned generator.

The Experiment

Synthetic dataset creation

Emails are a corpus of generic text sequences mixed up with some specific information. Extracting and preparing a training dataset for this use case can be tricky. Here are the top 2 reasons:

A greeting or a opening sentence could be “Hi Rita, How are you doing ?”, We don’t want to be learning the name “Rita”, but we should be able to help in composing opening sentences or greetings.
Private information like phone numbers, location, address or even bank account numbers can be part of the message. So care must be taken to identify and exclude sensitive information, while creating training dataset. In fact it is ok to compromise on the quality of prediction in an effort to prevent divulging sensitive or irrelevant information in predictions. Besides, serious violations can attract legal issues.

I used the Enron email dataset as a source to extract and prepare few email messages for this experiment. Now that being said there is no one good solution to extract training dataset from the email corpus. So if you want to play around, get creative. But remember that fairness and privacy is of utmost importance.

In the Enron dataset the entire email content(To, Subject, Body..) is in one CSV field. Here is the link to the code to split those fields if you would like to save some time.

Implementation details

Sample of some short messages used in this experiment.

Listed above is a small sample of possible email messages with or without the subject. Couple of important things to note:

One, If you check the the examples I have shown to demonstrate how gmail does sentence completion, you can see the predictions don’t actually wait for you to finish an entire word. If you started typing “call me wi”, it would have predicted “with questions if you have any” . So we might need a LM much more granular than the ones at word level, Hence I chose to build at a char ngram level. One way to accomplish that is to prepare the dataset as shown below:

Two, The subject text is added (prefixed) to the email body text as a part of data preparation, so a single training record = subject text (if one available) + message text

Again this is purely for the experimental purposes, because with a large corpus this style of data preparation can be very memory intensive as one text sequence turns into multiple sequences with char ngram prefixes.

Now how does a char ngram level LM change our Encoder-Decoder architecture ? It doesn’t, architecture would be the same, but how it trains changes (shown below).

Char n-gram conditioning for generation (shown in red)

Brass tacks: The architecture choices that worked well were a BiLSTM Encoder and a LSTM Decoder. For simplicity I wrote this in Python, Keras. I also built a simple HTML App and a REST endpoint to demonstrate how the model works (or is it even usable). I have not hosted the app anywhere yet, so you can checkout the demo for now (below). But I will update the link to the demo as soon as I host it.

Code

Link to the LM code (notebook)
Link to Enron email dataset preparation code (Gist)

Future work and possible extensions

Understand the Perplexity of the above LM.
Add Attention mechanism.
Replace LSTMs with Transformers.
Try longer messages and longer prompts.

Demo

Over to you now,

The ability to generate text sequences with a prompt at different granularities is super powerful. Where can we go from here? Can we look beyond just improving productivity ? Build a smart prescription authoring feature for doctors ? A smart contract authoring feature for lawyers. See ? the possibilities are endless. I hope I have tweaked your proactiveness, so go ahead and extend this idea, build a great solution. I would love to hear how you would extend this idea.