Natural language processing | Deep learning

Machine translation with the seq2seq model: Different approaches

Discussing the two different approaches for machine translation using the seq2seq model.

Dhruvil Shah

Published in

Towards Data Science

12 min readJun 17, 2020

Machine translation is a computational linguistics sub-field that examines how software is used to translate text or speech from one language to another. MT performs mechanical substitution of words in one language for words in another on a simple level, but this alone rarely yields effective translation since it involves comprehension of entire sentences and their nearest counterparts in the target language. Two given languages may have completely different structures. Words in a language do not have equivalent words in another language. Also, many words have more than one meaning. Solving this problem with neural techniques is a fast-growing field that leads to better translations and it handles differences in translation of idioms, and typology.

In this article, we are going to build a translator that can translate an English sentence to a Hindi sentence. You can create your translator for different languages by simply changing the dataset we are going to use here. We will use the Recurrent Neural Network topic — seq2seq i.e. the Encoder-Decoder model. In the below article, the seq2seq model is used to build a generative chatbot.

Generative chatbots using the seq2seq model!

A chatbot is a software that provides a real conversational experience to the user. There are closed domain chatbots…

towardsdatascience.com

Machine translation is more or less similar to what is done in the above article. The prime difference in building a generative chatbot and a machine translator is of the dataset and text preprocessing. That said, the steps we will follow here will be similar to those in the below article.

There are two approaches we can take when doing machine translation. We will discuss them in the upcoming sections.

Introduction to the seq2seq approach for Machine translation

The seq2seq model also called the encoder-decoder model uses Long Short Term Memory- LSTM for text generation from the training corpus. The seq2seq model is also useful in machine translation applications. What does the seq2seq or encoder-decoder model do in simple words? It predicts a word given in the user input and then each of the next words is predicted using the probability of likelihood of that word to occur. In building our Generative chatbot we will use this approach for text generation given in the user input.

Machine translation using the Encoder-Decoder model

The encoder outputs a final state vector (memory) which becomes the initial state for the decoder. We use a method called teacher forcing to train the decoder which enables it to predict the following words in a target sequence given in the previous words. As shown above, states are passed through the encoder to each layer of the decoder. ‘I’, ‘do’, ‘not’, and ‘know’ are called input tokens while ‘मुझे’, ‘नहीं’, and ‘पता’ are called target tokens. The likelihood of token ‘पता’ depends on the previous words and the encoder states. We are adding ‘<END>’ token to let our decoder know when to stop. You can learn more about the seq2seq model here.

Let’s start building our translator from scratch! The first task we will have to do is preprocess our dataset.

Preprocessing the dataset

The dataset to be used here is self-created with the help of a dataset available on a public repository on GitHub. You can find the code along with the dataset from the project link given at the end of this article. The dataset contains 10,000 English sentences and the corresponding Hindi translations.

First, we will have to clean our corpus with the help of Regular Expressions. Then, we will need to make pairs like English-Hindi so that we can train our seq2seq model. We will do these tasks as shown below.

import re
import random
data_path = "/Data/English.txt"
data_path2 = "/Data/Hindi.txt"# Defining lines as a list of each line
with open(data_path, 'r', encoding='utf-8') as f:
  lines = f.read().strip().split('\n')
with open(data_path2, 'r', encoding='utf-8') as f:
  lines2 = f.read().strip().split('\n')lines = [" ".join(re.findall(r"[A-Za-z0-9]+",line)) for line in lines]
lines2 = [re.sub(r"%s|\(|\)|<|>|%|[a-z]|[A-Z]|_",'',line) for line in lines2]# Grouping lines by response pair
pairs = list(zip(lines,lines2))
random.shuffle(pairs)

After creating pairs we can also shuffle those before training. Our pairs will look like this now:

[('he disliked that old black automobile', 'उन्होंने उस पुराने काले ऑटोमोबाइल को नापसंद किया।'), ('they dislike peaches pears and apples', 'वे आड़ू, नाशपाती और सेब को नापसंद करते हैं।'),...]

Here, ‘he disliked that old black automobile’ is input sequence, and ‘उन्होंने उस पुराने काले ऑटोमोबाइल को नापसंद किया।’ is a target sequence. We will have to create separate lists for input sequences and target sequences and we will also need to create lists for unique tokens (input tokens and target tokens) in our dataset. For target sequences, we will add ‘<START>’ at the beginning of the sequence and ‘<END>’ at the end of the sequence so that our model knows where to start and end text generation. We will do this as shown below.

import numpy as np
input_docs = []
target_docs = []
input_tokens = set()
target_tokens = set()
for line in pairs:
  input_doc, target_doc = line[0], line[1]
  # Appending each input sentence to input_docs
  input_docs.append(input_doc)
  # Splitting words from punctuation  
  target_doc = " ".join(re.findall(r"[\w']+|[^\s\w]", target_doc))
  # Redefine target_doc below and append it to target_docs
  target_doc = '<START> ' + target_doc + ' <END>'
  target_docs.append(target_doc)
  # Now we split up each sentence into words and add each unique word to our vocabulary set
  for token in re.findall(r"[\w']+|[^\s\w]", input_doc):
    if token not in input_tokens:
      input_tokens.add(token)
  for token in target_doc.split():
    if token not in target_tokens:
      target_tokens.add(token)input_tokens = sorted(list(input_tokens))
target_tokens = sorted(list(target_tokens))
num_encoder_tokens = len(input_tokens)
num_decoder_tokens = len(target_tokens)

The two different approaches

A key thing to notice here is that while creating target_doc we are splitting words from punctuation. This means that the target sequence ‘वे आड़ू, नाशपाती और सेब को नापसंद करते हैं।’ will become ‘व े आड ़ ू, न ा शप ा त ी और स े ब क ो न ा पस ं द करत े ह ै ं ।’. This is done when we are performing character-level predictions. Another option to preprocess our target sequence is to simply append the sequence as it is. This is done when we want to train our model to predict the fixed words from the training corpus (word-level prediction). To use this approach comment out the bold statement in the above code snippet. When we are doing the character-level prediction we get 200 encoder tokens and 238 decoder tokens while in word-level prediction, we get 200 encoder tokens and 678 decoder tokens. We will discuss the performance difference between these two options in the latter section while discussing the accuracy and loss of the model. For now, let’s stick to the former (character-level) option.

Now, we have unique input tokens and target tokens for our dataset. Now we will create an input features dictionary that will store our input tokens as key-value pairs, the word being the key and value is the index. Similarly, for target tokens, we will create a target features dictionary. Features dictionary will help us encode our sentences into one-hot vectors. After all, computers only understand the numbers. To decode the sentences we will need to create the reverse features dictionary that stores index as a key and word as a value.

input_features_dict = dict(
    [(token, i) for i, token in enumerate(input_tokens)])
target_features_dict = dict(
    [(token, i) for i, token in enumerate(target_tokens)])reverse_input_features_dict = dict(
    (i, token) for token, i in input_features_dict.items())
reverse_target_features_dict = dict(
    (i, token) for token, i in target_features_dict.items())

Training setup

To train our seq2seq model we will use three matrices of one-hot vectors, Encoder input data, Decoder input data, and Decoder output data. The reason we are using two matrices for the Decoder is a method called teacher forcing which is used by the seq2seq model while training. What is the idea behind this? We have an input token from the previous timestep to help the model train for the current target token. Let’s create these matrices.

#Maximum length of sentences in input and target documents
max_encoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", input_doc)) for input_doc in input_docs])
max_decoder_seq_length = max([len(re.findall(r"[\w']+|[^\s\w]", target_doc)) for target_doc in target_docs])encoder_input_data = np.zeros(
(len(input_docs), max_encoder_seq_length, num_encoder_tokens), dtype='float32')
decoder_input_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens),
dtype='float32')
decoder_target_data = np.zeros(
(len(input_docs), max_decoder_seq_length, num_decoder_tokens), dtype='float32')for line, (input_doc, target_doc) in enumerate(zip(input_docs, target_docs)):
  for timestep, token in enumerate(re.findall(r"[\w']+|[^\s\w]", input_doc)):
    #Assign 1. for the current line, timestep, & word in encoder_input_data
    encoder_input_data[line, timestep, input_features_dict[token]] = 1.
  for timestep, token in enumerate(target_doc.split()):
    decoder_input_data[line, timestep, target_features_dict[token]] = 1.
    if timestep > 0:
      decoder_target_data[line, timestep - 1, target_features_dict[token]] = 1.

To get a clear understanding of how the dimensions of encoder_input_data works see the below figure from the above-mentioned article. The decoder_input_data and decoder_target_data similarly have the dimensions.

Training setup for the Encoder-decoder model

Our encoder model requires an input layer which defines a matrix for holding the one-hot vectors and an LSTM layer with some number of hidden states. Decoder model structure is almost the same as encoder’s but here we pass in the state data along with the decoder inputs.

from tensorflow import keras
from keras.layers import Input, LSTM, Dense
from keras.models import Model
#Dimensionality
dimensionality = 256
#The batch size and number of epochs
batch_size = 256
epochs = 100
#Encoder
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder_lstm = LSTM(dimensionality, return_state=True)
encoder_outputs, state_hidden, state_cell = encoder_lstm(encoder_inputs)
encoder_states = [state_hidden, state_cell]#Decoder
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(dimensionality, return_sequences=True, return_state=True)
decoder_outputs, decoder_state_hidden, decoder_state_cell = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

You can learn more about how to code the encoder-decoder model here as a full explanation of it is out of scope for this article.

Building and training seq2seq model

Now we will create our seq2seq model and train it with encoder and decoder data as shown below.

#Model
training_model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
#Compiling
training_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'], sample_weight_mode='temporal')
#Training
training_model.fit([encoder_input_data, decoder_input_data], decoder_target_data, batch_size = batch_size, epochs = epochs, validation_split = 0.2)

Here, we are using adam as an optimizer and categorical_crossentropy as our loss function. We call the .fit() method by giving the encoder and decoder input data (X/input) and decoder target data (Y/label).

Two different approaches — Performance comparison

After the training process, we get training accuracy of 53.35% and validation accuracy of 52.77% while the training loss and validation loss are 0.0720 and 0.1137 respectively. Look at the plots of accuracy and loss during the training process.

The training and validation accuracies we get for the word-level prediction are 71.07% and 72.99% respectively while the training and validation losses are 0.0185 and 0.0624 respectively. Look at the plots of accuracy and loss during the training process.

The accuracy curves are very smooth in case of character-level predictions while in the case of word-level predictions the curve contains many spikes. We are getting a very high accuracy in the beginning but the loss is also high and as the loss goes down the accuracy also tends to fluctuate and go down. This tells us not to rely on the latter approach even if it gives higher accuracy than the former approach as the spikes introduce uncertainty in the performance.

Testing setup

Now, to handle an input that the model has not seen we will need a model that decodes step-by-step instead of using teacher forcing because the model we created only works when the target sequence is known. In the Generative chatbot application, we will not know what the generated response will be for input the user passes in. For doing this, we will have to build a seq2seq model in individual pieces. Let’s first build an encoder model with encoder inputs and encoder output states. We will do this with the help of the previously trained model.

from keras.models import load_model
training_model = load_model('training_model.h5')encoder_inputs = training_model.input[0]
encoder_outputs, state_h_enc, state_c_enc = training_model.layers[2].output
encoder_states = [state_h_enc, state_c_enc]
encoder_model = Model(encoder_inputs, encoder_states)

Next, we will need to create placeholders for decoder input states as we do not know what we need to decode or what hidden state we will get.

latent_dim = 256
decoder_state_input_hidden = Input(shape=(latent_dim,))
decoder_state_input_cell = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_hidden, decoder_state_input_cell]

Now, we will create new decoder states and outputs with the help of the decoder LSTM and Dense layer that we trained earlier.

decoder_outputs, state_hidden, state_cell = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_hidden, state_cell]
decoder_outputs = decoder_dense(decoder_outputs)

Finally, we have the decoder input layer, the final states from the encoder, the decoder outputs from the Dense layer of the decoder, and decoder output states which is the memory during the network from one word to the next. We can bring this all together now and set up the decoder model as shown below.

decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)

Testing our model

At last, we will create a function that accepts our text inputs and generates a response using encoder and decoder that we created. In the function below, we pass in the NumPy matrix that represents our text sentence and we get the generated response back from it. I have added comments for almost every line of code for you to understand it quickly. What happens in the below function is this: 1.) We retrieve output states from the encoder 2.) We pass in the output states to the decoder (which is our initial hidden state of the decoder) to decode the sentence word by word 3.) Update the hidden state of decoder after decoding each word so that we can use previously decoded words to help decode new ones

We will stop once we encounter ‘<END>’ token that we added to target sequences in our preprocessing task or we hit the maximum length of the sequence.

def decode_response(test_input):
    #Getting the output states to pass into the decoder
    states_value = encoder_model.predict(test_input)
    #Generating empty target sequence of length 1
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    #Setting the first token of target sequence with the start token
    target_seq[0, 0, target_features_dict['<START>']] = 1.
    
    #A variable to store our response word by word
    decoded_sentence = ''
    
    stop_condition = False
    while not stop_condition:
      #Predicting output tokens with probabilities and states
      output_tokens, hidden_state, cell_state = decoder_model.predict([target_seq] + states_value)
      #Choosing the one with highest probability
      sampled_token_index = np.argmax(output_tokens[0, -1, :])
      sampled_token = reverse_target_features_dict[sampled_token_index]
      decoded_sentence += " " + sampled_token#Stop if hit max length or found the stop token
      if (sampled_token == '<END>' or len(decoded_sentence) > max_decoder_seq_length):
        stop_condition = True
      #Update the target sequence
      target_seq = np.zeros((1, 1, num_decoder_tokens))
      target_seq[0, 0, sampled_token_index] = 1.
      #Update states
      states_value = [hidden_state, cell_state]
    return decoded_sentence

Putting it all together — Machine Translation

Let’s create a class that contains methods required for running our translator.

class Translator:
  exit_commands = ("quit", "pause", "exit", "goodbye", "bye", "later", "stop")
  
  #Method to start the translator
  def start(self):
    user_response = input("Give in an English sentence. :) \n")
    self.translate(user_response)
  
  #Method to handle the conversation
  def translate(self, reply):
    while not self.make_exit(reply):
      reply = input(self.generate_response(reply)+"\n")#Method to convert user input into a matrix
  def string_to_matrix(self, user_input):
    tokens = re.findall(r"[\w']+|[^\s\w]", user_input)
    user_input_matrix = np.zeros(
      (1, max_encoder_seq_length, num_encoder_tokens),
      dtype='float32')
    for timestep, token in enumerate(tokens):
      if token in input_features_dict:
        user_input_matrix[0, timestep, input_features_dict[token]] = 1.
    return user_input_matrix
  
  #Method that will create a response using seq2seq model we built
  def generate_response(self, user_input):
    input_matrix = self.string_to_matrix(user_input)
    chatbot_response = decode_response(input_matrix)
    #Remove <START> and <END> tokens from chatbot_response
    chatbot_response = chatbot_response.replace("<START>",'')
    chatbot_response = chatbot_response.replace("<END>",'')
    return chatbot_response
  
  #Method to check for exit commands
  def make_exit(self, reply):
    for exit_command in self.exit_commands:
      if exit_command in reply:
        print("Ok, have a great day!")
        return True
    return False
  
translator = Translator()

All methods are self-explanatory in the above code. Below is the final output for our translator!

Two different approaches — Final output comparison

**Output for character-level prediction**

The above snapshots show translation done by our translator for two different approaches.

You can find all of the code above along with the dataset from GitHub. You can connect with me on LinkedIn also. If any query arises you can leave a response here or in my LinkedIn inbox.

Conclusion

We managed to get an accuracy of around 53% in the case of character-level prediction and 73% in the case of word-level prediction. Natural language processing is a domain that requires tons of data especially the machine translation task. It is developing and training neural networks for approximating the approach the human brain takes towards language processing. This deep learning strategy allows computers to handle human language much more efficiently. There are companies like Google and Microsoft that gives human-level accuracy in the machine translation task. The network these companies use is a lot more complex one as compared to the one we created here.