Neural Machine Translation — Using seq2seq with Keras

Translation from English to French using encoder-decoder model

Published in

Towards Data Science

8 min readJan 9, 2018

This article is motivated by this keras example and this paper on encoder-decoder network. The idea is to gain intuitive and detailed understanding from this example. My own implementation of this example referenced in this story is provided at my github link.

Before we start, it may help to go through my other post on LSTM that helps in understanding the fundamentals of LSTMs specifically in this context.

Below is the detailed network architecture used for training the seq2seq Encoder — Decoder network. We will refer this figure through out.

Fig A— Encoder-Decoder training architecture for NMT — image copyright@Ravindra Kompella

Firstly we will go about training the network. Then we will look at the inference models on how to translate a given English sentence to French. Inference model (used for predicting on the input sequence) has a slightly different decoder architecture and we will discuss that in detail when we come there.

Training the network —

So how does the training data look?

We have 10,000 english sentences and corresponding 10,000 translated French sentences. So our nb_samples = 10000.

Overall plan for training —

Create one-hot character embeddings for English and French sentences. These will be the inputs to the encoder and the decoder. The French one-hot character embeds will also be used as target data for loss function.
Feed character by character embeds into the encoder till the end of the English sentence sequence.
Obtain the final encoder states (hidden and cell states) and feed them into the decoder as its initial state.
Decoder will have 3 inputs at every time step — 2 decoder states and the French character embeds fed to it character by character.
At every step of the decoder, the output of the decoder is sent to softmax layer that is compared with the target data.

Detailed flow for training the network with code —

Refer to snippet 1 — Note that we have appended ‘\t’ for start of the french sentence and ‘\n’ to signify end of the french sentence. These appended french sentences will be used as inputs to decoder. All the english characters and french characters are collected in separate sets. These sets are converted to character level dictionaries (useful for retrieving the index and character values later).

Refer to snippet 2— Prepare the embeds for encoder input, decoder input and the target data embeds. We will create one-hot encoding for each character in English and French separately. These are called as tokenized_eng_sentences and tokenized_fra_sentences in the code snippet. These will be the inputs to encoder and decoder respectively. Note that the target_data french character embeds that we compare at the softmax layer output are offset by (t+1) compared to the decoder input embeds (because there is no start tag in target data — refer to the above architecture diagram for more clarity). Hence the target_data in the below code snippet is accordingly offset ( note k-1 in second dimension of the target_data array below)

Refer to snippet 2 — As we noted in my other post on LSTM, the embeds (tokenized_eng_sentences and tokenized_fra_sentences and target_data) are 3D arrays. The first dimension corresponds to nb_samples ( =10,000 in this case). The second dimension corresponds to the maximum length of english / french sentence and the third dimension corresponds to total number of english / french characters.

Refer to snippet 3: We will input character by character (of-course, their corresponding one hot embeds) into the encoder network. For the encoder_LSTM, we had set return_state = True . We did not do return_sequences = True (and by default this is set to False). This would mean that we obtain only the final encoded cell state and the encoded hidden state at the end of the input sequence and not the intermediate states at every time step. These will be the final encoded states that are used to initialize the state of the decoder.

Refer to snippet 3 — Also note that the input shape has been specified as (None, len(eng_chars)). This means the encoder LSTM can dynamically unroll that many timesteps as the number of characters till it reaches the end of sequence for that sentence.

Refer to snippet 4 — Inputs to the decoder will be the french character embeds (contained in tokenized_fra_sentences array) one by one at each time step along with the previous state values. The previous states for the first step of the decoder will be initialized with the final encoder states that we collected earlier in snippet 3. For this reason, note that the initial_state=encoder_states has been set in the below code snippet. From the subsequent step on wards the state inputs to decoder will be its cell state and its hidden state.

Also from the above code snippet, notice that the decoder is setup with return_sequences = True along with return_state = True. So we obtain decoder output and the two decoder states at every timestep. While return_state = True has been declared here, we are not going to use the decoder states while training the model. The reason for its presence is that they will be used while building the decoder inference model (that we will see later). The decoder output is passed through the softmax layer that will learn to classify the correct french character.

Refer to snippet 5 — The loss function is categorical cross entropy that is obtained by comparing the predicted values from softmax layer with the target_data (one-hot french character embeds).

Now the model is ready for training. Train the entire network for the specified number of epochs.

Snippet 5 — y = target_data ( containing one hot french character embeds)

Testing (Inference mode) —

Below is the architecture used for inference models —The inference model will leverage all the network parameters learnt during training but we define them separately because the inputs and outputs during inference are different from what they were during training the network.

From the below figure, observe that there are no changes on the encoder side of the network. So we feed the new English sentence (one hot character embedded) vector as input sequence to the encoder model and obtain the final encoding states.

Fig B— Encoder-Decoder Inference model architecture for NMT —image copyright @Ravindra Kompella

Contrast this figure B with figure A on the decoder side. The major changes can be seen are as below —

At the first time step, the decoder has 3 inputs — the start tag ‘\t’ and the two encoder states. We input the first character as ‘\t’ ( its one hot embed vector) into the first time step of the decoder.
Then the decoder outputs the first predicted character (assume it is ‘V’).
Observe how the blue lines are getting connected back into the decoder input for the next time step. So this predicted character ‘V’ will be fed as an input to the decoder at the next timestep.
Also note that we only obtain the one hot embed vector of the predicted character using the np.argmax function on the output of the softmax layer at each timestep. So we do a reverse dictionary lookup on the index to obtain the actual character ‘V’.
From next time step on wards the decoder still has 3 inputs but different from the first time step . They being — one hot encode of previous predicted character, previous decoder cell state and the previous decoder hidden state

Given the above understanding, now lets look at the code —

Refer to snippet 6 — The Encoder inference model is quite straightforward. This is going to output only the encoder_states.

Snippet 6 — Encoder inference model

Refer to snippet 7 — The decoder model is more elaborate. Note that we create separate ‘Input’ for decoder hidden state and decoder cell state. This is because we are going to feed these states at every time step (other than the first time step — Recall that at the first time step we feed only the encoder states) into the decoder and the decoder inference model is a separate standalone model. Both the encoder and decoder will be called recursively for each character that is to be generated in the translated sequence.

Refer to snippet 8 —We get the encoder states into states_val variable. On the first call inside the while loop, these hidden and cell states from the encoder will be used to initialize the decoder_model_inf that are provided as input to the model directly. Once we predict the character using softmax, we now input this predicted character ( using the target_seq 3D array for one-hot embed of the predicted character) along with the updated states_val (updated from the previous decoder states) for the next iteration of the while loop. Note that we reset our target_seq before we create a one-hot embed of the predicted character every time in the while loop.

Snippet 8 — Function to recursively call the decoder for predicting the translated character sequence.

That’s it! Now we have a trained model that can translate English sentences to French! Below are the results obtained after training the network for 25 epochs.

Results obtained using some sample training data

If you plan to use any of the above architecture diagram figures, please feel free to do so and request you to mention my name in the image credit.

Please show your applause by holding the clapping icon, if you find any useful takeaway from this article.

Neural Machine Translation — Using seq2seq with Keras

Translation from English to French using encoder-decoder model

Written by Ravindra Kompella