Generating text with deep learning

Published in

Towards Data Science

5 min readMar 11, 2017

Deep learning seems to everywhere in this day and age, its now even in your phone and powering all sorts of applications on it, especially the voice chat assistants. During this blog post, we will be exploring how to generate text using deep learning, in order to first do, we need to build a model that can model language.

Language Models

Language models are models that are able to model a natural language. The goal of a language model is to understand the flow of words and be able to predict what word or piece of punctuation comes next given either a single word a long series of words. They have gotten a lot of attention as of late due to interest in building chatbots along with machine translation models, which use language models in order to generate response. At their core, a language model is modeling P(W|T) where in this case, W is the next word, and T is the context of all the words and punctuation that have come before it.

Deep learning Architecture

Word Vectors

Word vectors are a way of representing a given word as a vector of numbers, we could for example represent the word I as a vector that looks like [0.05, 0.85, -.25, 0.97]. The motivation behind representing words as vectors is that similar words have similar word vectors, below is a t-SNE projection of word vectors.

Such representations enable learning about the space of where words usually are clustered around. This also means that if words are encountered that haven’t been seen during training, the new words word vectors can be generalized as well. To learn more about word vectors checkout the following paper on word vectors.

Recurrent Neural Networks

Recurrent neural nets are networks that are able to maintain memory through time or other sequential inputs. From the diagram above, at each time-step, an RNN sees the previous state, and the current word embedding, enabling it to make representations that encode the past along with the present. The formula for the hidden layer of an RNN can be represented with the following formula:

h(t) represents the state at the current time step
h(t-1) represents the previous time step.
H is a square matrix that is h x h, this is a hyper parameter for the model. Each previous timestep is multiplied by the same matrix
e(t) is the word embedding at the current time step
I is a transformation matrix that transforms the current embedding into an h x h matrix so that it can added to the previous timestep. This matrix has is of size embed_size x hidden_size
b1 is the bias term here.

The sigmoid function:

Its a function that squashes inputs between 0 and 1. The beta is a learned parameter that is updated. RNNS are very complex and fascinating building blocks for neural networks. While I’ve only given a birds eye view of an RNN, the following blog has an excellent in depth explanation pertaining to RNN’s and even more sophisticated types of RNNs.

Softmax

The softmax is the final layer to the language model network, and is the layer that will be selecting which word to output. The softmax works by normalizing the inputs and outputting a vector that sums to 1. The entry with the highest value is in essence having the highest probability.

Loss Function

Cross-entropy loss is used to train the network with respect to the words that are outputted by the network. The cross-entropy loss function has the following formula:

y itself is the true target value, while y_hat is the predicted value. The most important thing to remember is that the Y vector will only be 1 at the correct class and 0 everywhere else. The intent is the increase the probability of the correct class, and to decrease the probabilities of the rival classes. Taking the log of the output is used since the log function has a steep drop-off in value as it approaches 0

The lower the probability that is outputted for the correct class, the steeper the gradient will be.

Evaluation Metric:

Language models use perplexity as the evaluation metric which can be written as 2 to the power of the loss. What perplexity is measuring is how close are the distributions of the outputs to the actual, the lower the better. Perplexity also represents a models uncertainty. Perplexity informs us of how many words is the model considering for a given output, having a perplexity of 100 means that the model could choose 100 words at an equally likely probability, we would like this number to be as low as possible since lower choice corresponds to a higher likelihood of choosing the actually correct one.

SNTNC NET

I invite everyone to explore sntnc net. If you generate a very interesting or funny result feel free to post it in the comments!

Written by Dmitri Iourovitski