Seq2Seq model in TensorFlow

Published in

Towards Data Science

9 min readMay 2, 2018

In this project, I am going to build language translation model called seq2seq model or encoder-decoder model in TensorFlow. The objective of the model is translating English sentences to French sentences. I am going to show the detailed steps, and they will answer to the questions likehow to define encoder model, how to define decoder model, how to build the entire seq2seq model, how to calculate the loss and clip gradients.

Please visit the Github repo for more detailed information and actual codes in Jupyter notebook. It will cover a bit more topics like how to preprocess the dataset, how to define inputs, and how to train and get prediction.

This is a part of Udacity’s Deep Learning Nanodegree. Some codes/functions (save, load, measuring accuracy, etc) are provided by Udacity. However, majority part is implemented by myself along with much richer explanations and references on each section. Also, base figures (about model) is borrowed from Luong (2016).

Steps to build Seq2Seq model

You can separate the entire model into 2 small sub-models. The first sub-model is called as [E] Encoder, and the second sub-model is called as [D] Decoder. [E] takes a raw input text data just like any other RNN architectures do. At the end, [E] outputs a neural representation. This is a very typical work, but you need to pay attention what this output really is. The output of [E] is going to be the input data for [D].

That is why we call [E] as Encoder and [D] as Decoder. [E] makes an output encoded in neural representational form, and we don’t know what it really is. It is somewhat encrypted. [D] has the ability to look inside the [E]’s output, and it will create a totally different output data (translated in French in this case).

In order to build such a model, there are 6 steps overall. I noted what functions to be implemented are related to each steps.

(1) define input parameters to the encoder model

enc_dec_model_inputs

(2) build encoder model

encoding_layer

(3) define input parameters to the decoder model

enc_dec_model_inputs, process_decoder_input, decoding_layer

(4) build decoder model for training

decoding_layer_train

(5) build decoder model for inference

decoding_layer_infer

(6) put (4) and (5) together

decoding_layer

(7) connect encoder and decoder models

seq2seq_model

(8) define loss function, optimizer, and apply gradient clipping

**Fig 1. Neural Machine Translation / Training Phase**

Encoder Input (1), (3)

enc_dec_model_inputs function creates and returns parameters (TF placeholders) related to building model.

inputs placeholder will be fed with English sentence data, and its shape is [None, None]. The first None means the batch size, and the batch size is unknown since user can set it. The second None means the lengths of sentences. The maximum length of setence is different from batch to batch, so it cannot be set with the exact number.

One option is to set the lengths of every sentences to the maximum length across all sentences in every batch. No matter which method you choose, you need to add special character, <PAD> in empty positions. However, with the latter option, there could be unnecessarily more <PAD> characters.

targets placeholder is similar to inputs placeholder except that it will be fed with French sentence data.

target_sequence_length placeholder represents the lengths of each sentences, so the shape is None, a column tensor, which is the same number to the batch size. This particular value is required as an argument of TrainerHelper to build decoder model for training. We will see in (4).

max_target_len gets the maximum value out of lengths of all the target sentences(sequences). As you know, we have the lengths of all the sentences in target_sequence_length parameter. The way to get the maximum value from it is to use tf.reduce_max.

Process Decoder Input (3)

On the decoder side, we need two different kinds of input for training and inference purposes repectively. While training phase, the input is provided as target label, but they still need to be embeded. On the inference phase, however, the output of each time step will be the input for the next time step. They also need to be embeded and embedding vector should be shared between two different phases.

In this section, I am going to preprocess the target label data for the training phase. It is nothing special task. What all you need to do is add <GO> special token in front of all target data. <GO> token is a kind of guide token as saying like "this is the start of the translation". For this process, you need to know three libraries from TensorFlow.

TF strided_slice

extracts a strided slice of a tensor (generalized python array indexing).
can be thought as splitting into multiple tensors with the striding window size from begin to end
arguments: TF Tensor, Begin, End, Strides

TF fill

creates a tensor filled with a scalar value.
arguments: TF Tensor (must be int32/int64), value to fill

TF concat

concatenates tensors along one dimension.
arguments: a list of TF Tensor (tf.fill and after_slice in this case), axis=1

After preprocessing the target label data, we will embed it later when implementing decoding_layer function.

Encoding (2)

**Fig 3. Encoding model highlighted — Embedding/RNN layers**

As depicted in Fig 3, the encoding model consists of two different parts. The first part is the embedding layer. Each word in a sentence will be represented with the number of features specified as encoding_embedding_size. This layer gives much richer representative power for the words useful explanation. The second part is the RNN layer(s). You can make use of any kind of RNN related techniques or algorithms. For example, in this project, multiple LSTM cells are stacked together after dropout technique is applied. You can use different kinds of RNN cells such as GRU.

Embedding layer

TF contrib.layers.embed_sequence

RNN layers

TF contrib.rnn.LSTMCell
: simply specifies how many internal units it has
TF contrib.rnn.DropoutWrapper
: wraps a cell with keep probability value
TF contrib.rnn.MultiRNNCell
: stacks multiple RNN (type) cells
: how this API is used in action?

Encoding model

TF nn.dynamic_rnn
: put Embedding layer and RNN layer(s) all together

Decoding — Training process (4)

Decoding model can be thought of two separate processes, training and inference. It is not they have different architecture, but they share the same architecture and its parameters. It is that they have different strategy to feed the shared model. For this(training) and the next(inference) section, Fig 4 shows clearly shows what they are.

While encoder uses TF contrib.layers.embed_sequence, it is not applicable to decoder even though it may require its input embeded. That is because the same embedding vector should be shared via training and inferece phases. TF contrib.layers.embed_sequence can only embed the prepared dataset before running. What needed for inference process is dynamic embedding capability. It is impossible to embed the output from the inference process before running the model because the output of the current time step will be the input of the next time step.

How we can embed? We will see soon. However, for now, what you need to remember is training and inference processes share the same embedding parameters. For the training part, embeded input should be delivered. On the inference part, only embedding parameters used in the training part should be delivered.

Let’s see the training part first.

tf.contrib.seq2seq.TrainingHelper
: TrainingHelper is where we pass the embeded input. As the name indicates, this is only a helper instance. This instance should be delivered to the BasicDecoder, which is the actual process of building the decoder model.
tf.contrib.seq2seq.BasicDecoder
: BasicDecoder builds the decoder model. It means it connects the RNN layer(s) on the decoder side and the input prepared by TrainingHelper.
tf.contrib.seq2seq.dynamic_decode
: dynamic_decode unrolls the decoder model so that actual prediction can be retrieved by BasicDecoder for each time steps.

Decoding — Inference process (5)

tf.contrib.seq2seq.GreedyEmbeddingHelper
: GreedyEmbeddingHelper dynamically takes the output of the current step and give it to the next time step’s input. In order to embed the each input result dynamically, embedding parameter(just bunch of weight values) should be provided. Along with it, GreedyEmbeddingHelper asks to give the start_of_sequence_id for the same amount as the batch size and end_of_sequence_id.
tf.contrib.seq2seq.BasicDecoder
: same as described in the training process section
tf.contrib.seq2seq.dynamic_decode
: same as described in the training process section

Build the Decoding Layer (3), (6)

Embed the target sequences

TF contrib.layers.embed_sequence creates internal representation of embedding parameter, so we cannot look into or retrieve it. Rather, you need to create a embedding parameter manually by TF Variable.
Manually created embedding parameter is used for training phase to convert provided target data(sequence of sentence) by TF nn.embedding_lookupbefore the training is run. TF nn.embedding_lookup with manually created embedding parameters returns the similar result to the TF contrib.layers.embed_sequence. For the inference process, whenever the output of the current time step is calculated via decoder, it will be embeded by the shared embedding parameter and become the input for the next time step. You only need to provide the embedding parameter to the GreedyEmbeddingHelper, then it will help the process.
How embedding_lookup works?
: In short, it selects specified rows
Note: Please be careful about setting the variable scope. As mentioned previously, parameters/variables are shared between training and inference processes. Sharing can be specified via tf.variable_scope.

Construct the decoder RNN layer(s)

As depicted in Fig 3 and Fig 4, the number of RNN layer in the decoder model has to be equal to the number of RNN layer(s) in the encoder model.

Create an output layer to map the outputs of the decoder to the elements of our vocabulary

This is just a fully connected layer to get probabilities of occurance of each words at the end.

Build the Seq2Seq model (7)

In this section, previously defined functions, encoding_layer, process_decoder_input, and decoding_layer are put together to build the big picture, Sequence to Sequence model.

Build Graph + Define Loss, Optimizer w/ Gradient Clipping

seq2seq_model function creates the model. It defines how the feedforward and backpropagation should flow. The last step for this model to be trainable is deciding and applying what optimization algorithms to use. In this section, TF contrib.seq2seq.sequence_loss is used to calculate the loss, then TF train.AdamOptimizer is applied to calculate the gradient descent on the loss. Let's go over eatch steps in the code cell below.

load data from the checkpoint

(source_int_text, target_int_text) are the input data, and (source_vocab_to_int, target_vocab_to_int) is the dictionary to lookup the index number of each words.
max_target_sentence_length is the length of the longest sentence from the source input data. This will be used for GreedyEmbeddingHelper when building inference process in the decoder mode.

create inputs

inputs (input_data, targets, target_sequence_length, max_target_sequence_length) from enc_dec_model_inputs function
inputs (lr, keep_prob) from hyperparam_inputs function

build seq2seq model

build the model by seq2seq_model function. It will return train_logits(logits to calculate the loss) and inference_logits(logits from prediction).

cost function

TF contrib.seq2seq.sequence_loss is used. This loss function is just a weighted softmax cross entropy loss function, but it is particularly designed to be applied in time series model (RNN). Weights should be explicitly provided as an argument, and it can be created by TF sequence_mask. In this project, TF sequence_mask creates [batch_size, max_target_sequence_length] size of variable, then maks only the first target_sequence_length number of elements to 1. It means parts will have less weight than others.

Optimizer

TF train.AdamOptimizer is used, and this is where the learning rate should be specified. You can choose other algorithms as well, this is just a choice.

Gradient Clipping

Since recurrent neural networks is notorious about vanishing/exploding gradient, gradient clipping technique is believed to improve the issues.
The concept is really easy. You decide thresholds to keep the gradient to be in a certain boundary. In this project, the range of the threshold is between -1 and 1.
Now, you need to apply this conceptual knowledge to the TensorFlow code. Luckily, there is the official guide for this TF Gradient Clipping How?. In breif, you get the gradient values from the optimizer manually by calling compute_gradients, then manipulate the gradient values with clip_by_value. Lastly, you need to put back the modified gradients into the optimizer by calling apply_gradients

About myself

My background in deep learning is Udacity {Deep Learning ND & AI-ND with contentrations(CV, NLP, VUI)}, Coursera Deeplearning.ai Specialization (AI-ND has been split into 4 different parts, which I have finished all together with the previous version of ND). Also, I am currently taking Udacity Data Analyst ND, and I am 80% done currently.