Neural Machine Translation (NMT) with Attention Mechanism

A guide to translate languages with Deep Learning!

Harshil Patel
Towards Data Science

--

Overview

It is an undeniable truth that in this era of globalization, language translation plays a vital role in communication among the denizens of different nation’s. Moreover, in a country like India - which a multilingual nation, language difference could be observed in its states itself! Hence, considering the importance of Language Translation, it becomes necessary to develop a system that could easily translate an unknown language to a known language.

In accordance with this, in this story, we will make a Deep Learning model that will translate English sentences to Marathi sentences. I have chosen Marathi language as it is easy for me to recognize. You can use any other language that is comfortable for you, as the model nearly remains the same. Moreover, I will try to briefly explain a major concept in language processing called Attention Mechanism here!

Prerequisites

  1. Working of Long Short Time Memory(LSTM) cells
  2. Working of TensorFlow, Keras and some other mandatory python libraries.

What is an Attention Mechanism?

The major drawback of encoder-decoder model in sequence to sequence recurrent neural network is that it can only work on short sequences. It is difficult for the encoder model to memorize long sequences and convert it into a fixed-length vector. Moreover, the decoder receives only one information that is the last encoder hidden state. Hence it's difficult for the decoder to summarize large input sequence at once. So, how do we overcome this problem?

How about if we give a vector representation from every encoder step to the decoder model!

Now, this is where the concept of ‘Attention Mechanism’ comes. The major intuition about this is that it predicts the next word by concentrating on a few relevant parts of the sequence rather than looking on the entire sequence.

In layman terms it can be described as an interference between encoder and decoder which extracts useful information from encoder and transmits it back to the decoder.

Animation of Attention Layer

Refer here for detailed understanding about Attention Mechanism.

There are mainly two types of attention mechanism:

  • Global Attention
  • Local Attention

Global Attention

Global Attention are those attention in which all the hidden state vectors of the encoder are passed to get the context vector.

Local Attention

Local Attention are those attention in which only a few hidden state vectors of encoder are considered for the generation of context vector.

We will be using global attention in this story. Let’s now make use of attention mechanism and develop a language translator that will convert English sentence to Marathi sentence.

Implementation

Library Imports

Open Jupyter Notebook and import some required libraries:

import pandas as pd
from sklearn.model_selection import train_test_split
import string
from string import digits
import re
from sklearn.utils import shuffle
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import LSTM, Input, Dense,Embedding, Concatenate, TimeDistributed
from tensorflow.keras.models import Model,load_model, model_from_json
from tensorflow.keras.utils import plot_model
from tensorflow.keras.preprocessing.text import one_hot, Tokenizer
from tensorflow.keras.callbacks import EarlyStopping
import pickle as pkl
import numpy as np

Download Dataset

We will be working on a language dataset available here.

This site contains a dataset of numerous languages and its translation to English. You can download any language dataset as per your preference and comfort. However, remember to choose a dataset which is quite huge, so that we can get better results after training the model. Here, I will be downloading Marathi-English dataset which comprises of 38696 sentences.

After downloading load dataset, import the data as mentioned below:

with open('mar.txt','r') as f:
data = f.read()

Preprocessing Dataset

Data Transformation

As you can see that it is a raw text file and hence it is necessary to clean and transform it as per our preference. We will separate Marathi and English sentences and form a list of it, continuing it by storing it into a dataframe so that it's easy for us to reuse it again easily.

uncleaned_data_list = data.split('\n')
len(uncleaned_data_list)
uncleaned_data_list = uncleaned_data_list[:38695]
len(uncleaned_data_list)
english_word = []
marathi_word = []
cleaned_data_list = []
for word in uncleaned_data_list:
english_word.append(word.split('\t')[:-1][0])
marathi_word.append(word.split('\t')[:-1][1])
language_data = pd.DataFrame(columns=['English','Marathi'])
language_data['English'] = english_word
language_data['Marathi'] = marathi_word
language_data.to_csv('language_data.csv', index=False)

language_data.head()

english_text = language_data['English'].values
marathi_text = language_data['Marathi'].values
len(english_text), len(marathi_text)

Data Cleaning

Now let’s clean the data and make it suitable for our model. In the cleaning process, we will convert into lower case, remove all punctuation and other unnecessary letters and digits too.

#to lower case
english_text_ = [x.lower() for x in english_text]
marathi_text_ = [x.lower() for x in marathi_text]
#removing inverted commas
english_text_ = [re.sub("'",'',x) for x in english_text_]
marathi_text_ = [re.sub("'",'',x) for x in marathi_text_]
def remove_punc(text_list):
table = str.maketrans('', '', string.punctuation)
removed_punc_text = []
for sent in text_list:
sentance = [w.translate(table) for w in sent.split(' ')]
removed_punc_text.append(' '.join(sentance))
return removed_punc_text
english_text_ = remove_punc(english_text_)
marathi_text_ = remove_punc(marathi_text_)
remove_digits = str.maketrans('', '', digits)
removed_digits_text = []
for sent in english_text_:
sentance = [w.translate(remove_digits) for w in sent.split(' ')]
removed_digits_text.append(' '.join(sentance))
english_text_ = removed_digits_text
# removing the digits from the marathi sentances
marathi_text_ = [re.sub("[२३०८१५७९४६]","",x) for x in marathi_text_]
marathi_text_ = [re.sub("[\u200d]","",x) for x in marathi_text_]
# removing the stating and ending whitespaces
english_text_ = [x.strip() for x in english_text_]
marathi_text_ = [x.strip() for x in marathi_text_]

Adding ‘start’ and ‘end’ tag to marathi sentence. This will help the decoder to know from where to start decoding and when to end.

# Putting the start and end words in the marathi sentances
marathi_text_ = ["start " + x + " end" for x in marathi_text_]
# manipulated_marathi_text_
marathi_text_[0], english_text_[0]

(‘start जा end’, ‘go’)

Data preparation for model building

We will split our dataset with a ratio of 0.1 so that our trained model can give precise results. X_train and y_train will be our training set while X_test and y_test will be our testing/validation set.

X = english_text_
Y = marathi_text_
X_train, X_test, y_train, y_test=train_test_split(X,Y,test_size=0.1)

Let’s determine the maximum length of our sentences in both English and Marathi:

def Max_length(data):
max_length_ = max([len(x.split(' ')) for x in data])
return max_length_
#Training data
max_length_english = Max_length(X_train)
max_length_marathi = Max_length(y_train)
#Test data
max_length_english_test = Max_length(X_test)
max_length_marathi_test = Max_length(y_test)
max_length_marathi, max_length_english

(26, 32)

Tokenization:

As a neural network requires numerical data to process, it becomes necessary to convert our string input to a numerical list. One way of doing this is to use Tokenizer provided by keras-preprocessing library.

Also, remember it is mandatory to have an equal length of all input sequences in sequence-to-sequence models. So, we will pad extra ‘0s’ to make the sequence of the same length. This would be done by pad_sequence.

englishTokenizer = Tokenizer()
englishTokenizer.fit_on_texts(X_train)
Eword2index = englishTokenizer.word_index
vocab_size_source = len(Eword2index) + 1
X_train = englishTokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=max_length_english, padding='post')
X_test = englishTokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen = max_length_english, padding='post')
marathiTokenizer = Tokenizer()
marathiTokenizer.fit_on_texts(y_train)
Mword2index = marathiTokenizer.word_index
vocab_size_target = len(Mword2index) + 1
y_train = marathiTokenizer.texts_to_sequences(y_train)
y_train = pad_sequences(y_train, maxlen=max_length_marathi, padding='post')
y_test = marathiTokenizer.texts_to_sequences(y_test)
y_test = pad_sequences(y_test, maxlen = max_length_marathi, padding='post')
vocab_size_source, vocab_size_target

(5413, 12789)

X_train[0], y_train[0]

(array([ 1, 157, 5, 134, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0], dtype=int32),
array([ 1, 6, 22, 61, 253, 29, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32))

To save our preprocessing time whenever we reuse it again in future, we will save our important attributes. So, let’s do it first with the help of pickle library.

with open('NMT_data.pkl','wb') as f:
pkl.dump([X_train, y_train, X_test, y_test],f)
with open('NMT_Etokenizer.pkl','wb') as f:
pkl.dump([vocab_size_source, Eword2index, englishTokenizer], f)
with open('NMT_Mtokenizer.pkl', 'wb') as f:
pkl.dump([vocab_size_target, Mword2index, marathiTokenizer], f)
X_train = np.array(X_train)
y_train = np.array(y_train)
X_test = np.array(X_test)
y_test = np.array(y_test)

Model Building

Instead of a simple encoder-decoder architecture, we will be using Attention Mechanism as discussed earlier in this blog.

Keras does not officially support attention layer. So, we can either implement our own attention layer or use a third-party implementation. For now, we will be using a third party attention mechanism. You can download the attention layer from here and copy it in a different file called attention.py. This attention is an implementation of ‘Bahdanau Attention’ .

Let’s define the structure of our model:

from attention import AttentionLayer
from keras import backend as K
K.clear_session()
latent_dim = 500
# Encoder
encoder_inputs = Input(shape=(max_length_english,))
enc_emb = Embedding(vocab_size_source, latent_dim,trainable=True)(encoder_inputs)
#LSTM 1
encoder_lstm1 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output1, state_h1, state_c1 = encoder_lstm1(enc_emb)
#LSTM 2
encoder_lstm2 = LSTM(latent_dim,return_sequences=True,return_state=True)
encoder_output2, state_h2, state_c2 = encoder_lstm2(encoder_output1)
#LSTM 3
encoder_lstm3=LSTM(latent_dim, return_state=True, return_sequences=True)
encoder_outputs, state_h, state_c= encoder_lstm3(encoder_output2)
# Set up the decoder.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(vocab_size_target, latent_dim,trainable=True)
dec_emb = dec_emb_layer(decoder_inputs)
#LSTM using encoder_states as initial state
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs,decoder_fwd_state, decoder_back_state = decoder_lstm(dec_emb,initial_state=[state_h, state_c])
#Attention Layer
attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])
# Concat attention output and decoder LSTM output
decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_outputs, attn_out])
#Dense layer
decoder_dense = TimeDistributed(Dense(vocab_size_target, activation='softmax'))
decoder_outputs = decoder_dense(decoder_concat_input)
# Define the model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
plot_model(model, to_file='train_model.png', show_shapes=True)

You can modify this model as per your choice and requirement to get better results. You can change number of layers, number of units or some regularization techniques too. For the time being, let’s move forward and see what our model looks like!

model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])

Model Training

We will first define some callbacks so that it would be easy for model visualization and evaluation in future.

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)

We are using ‘Teacher Forcing’ technique for faster training of our model. In the teacher forcing method, we also pass the target data as the input to the decoder. For example, if we are going to predict ‘hello’, then we will pass ‘hello’ itself as an input to the decoder. Due to this it makes the learning process faster.

Let's train our Model:

history = model.fit([X_train, y_train[:,:-1]], y_train.reshape(y_train.shape[0], y_train.shape[1],1)[:,1:], 
epochs=50,
callbacks=[es],
batch_size=512,
validation_data = ([X_test, y_test[:,:-1]], y_test.reshape(y_test.shape[0], y_test.shape[1], 1)[:,1:]))

The execution time was around 39 seconds per epoch on 12GB NVIDIA Tesla K80 GPU. EarlyStopping was achieved at 18th epoch.

We can visualize the loss difference in both training and validation phase as:

from matplotlib import pyplot 
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
Loss Comparison

We are getting some pretty good results from our model with around 90% validation accuracy and a validation loss of 0.5303.

Model Saving and Loading

Let’s save our trained model with proper weights. Do remember to save the model like I have done as we have to load weights too for the inference model.

model_json = model.to_json()
with open("NMT_model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("NMT_model_weight.h5")
print("Saved model to disk")

Load model:

# loading the model architecture and asigning the weights
json_file = open('NMT_model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
model_loaded = model_from_json(loaded_model_json, custom_objects={'AttentionLayer': AttentionLayer})
# load weights into new model
model_loaded.load_weights("NMT_model_weight.h5")

Inference Model

In machine learning we use inference model to predict our output sequences by considering weights from a pre-trained model. In other terms, it can be said that its a model that deduces properties that are learned in training phase and are now used for predicting new sequences.

Let’s code our inference model:

latent_dim=500
# encoder inference
encoder_inputs = model_loaded.input[0] #loading encoder_inputs
encoder_outputs, state_h, state_c = model_loaded.layers[6].output #loading encoder_outputs
#print(encoder_outputs.shape)encoder_model = Model(inputs=encoder_inputs,outputs=[encoder_outputs, state_h, state_c])# decoder inference
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_hidden_state_input = Input(shape=(32,latent_dim))
# Get the embeddings of the decoder sequence
decoder_inputs = model_loaded.layers[3].output
#print(decoder_inputs.shape)
dec_emb_layer = model_loaded.layers[5]
dec_emb2= dec_emb_layer(decoder_inputs)# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_lstm = model_loaded.layers[7]
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=[decoder_state_input_h, decoder_state_input_c])
#attention inference
attn_layer = model_loaded.layers[8]
attn_out_inf, attn_states_inf = attn_layer([decoder_hidden_state_input, decoder_outputs2])
concate = model_loaded.layers[9]
decoder_inf_concat = concate([decoder_outputs2, attn_out_inf])
# A dense softmax layer to generate prob dist. over the target vocabulary
decoder_dense = model_loaded.layers[10]
decoder_outputs2 = decoder_dense(decoder_inf_concat)
# Final decoder model
decoder_model = Model(
[decoder_inputs] + [decoder_hidden_state_input,decoder_state_input_h, decoder_state_input_c],
[decoder_outputs2] + [state_h2, state_c2])

Predictions

Now we have trained the sequence to sequence model and created the inference model using the trained model for making a prediction. Let’s predict some Marathi sentences from the English sentences.

def decode_sequence(input_seq):
# Encode the input as state vectors.
e_out, e_h, e_c = encoder_model.predict(input_seq)
# Generate empty target sequence of length 1.
target_seq = np.zeros((1,1))
# Chose the 'start' word as the first word of the target sequence
target_seq[0, 0] = Mword2index['start']
stop_condition = False
decoded_sentence = ''
while not stop_condition:
output_tokens, h, c = decoder_model.predict([target_seq] + [e_out, e_h, e_c])
# Sample a token
sampled_token_index = np.argmax(output_tokens[0, -1, :])
if sampled_token_index == 0:
break
else:
sampled_token = Mindex2word[sampled_token_index]
if(sampled_token!='end'):
decoded_sentence += ' '+sampled_token
# Exit condition: either hit max length or find stop word.
if (sampled_token == 'end' or len(decoded_sentence.split()) >= (26-1)):
stop_condition = True
# Update the target sequence (of length 1).
target_seq = np.zeros((1,1))
target_seq[0, 0] = sampled_token_index
# Update internal states
e_h, e_c = h, c
return decoded_sentence

Forming a reverse vocabulary:

Eindex2word = englishTokenizer.index_word
Mindex2word = marathiTokenizer.index_word

Some transformation before giving a string to the function:

def seq2summary(input_seq):
newString=''
for i in input_seq:
if((i!=0 and i!=Mword2index['start']) and i!=Mword2index['end']):
newString=newString+Mindex2word[i]+' '
return newString
def seq2text(input_seq):
newString=''
for i in input_seq:
if(i!=0):
newString=newString+Eindex2word[i]+' '
return newString

Call the necessary functions and let’s test our translation model:

for i in range(10):  
print("Review:",seq2text(X_test[i]))
print("Original summary:",seq2summary(y_test[i]))
print("Predicted summary:",decode_sequence(X_test[i].reshape(1,32)))
print("\n")

Review: no one will tell you
Original summary: तुला कोणीही सांगणार नाही
Predicted summary: कोणीही तुला सांगणार नाही

Review: look ahead
Original summary: समोर बघा
Predicted summary: तिथे बघ

Review: im going to return this to tom
Original summary: मी हे टॉमला परत करायला जातोय
Predicted summary: मी ते स्वतःहून करणार आहे

Review: an eagle is flying in the sky
Original summary: आकाशात एक गरुड आहे
Predicted summary: न्यूयॉर्क अतिरेकी तो दुसर्याचा

Review: he speaks arabic
Original summary: तो अरबी बोलतो
Predicted summary: तो अरबी बोलतो

Review: clean up this mess
Original summary: हा पसारा साफ करून टाका
Predicted summary: हा पसारा साफ कर

Review: dont speak french in the class
Original summary: वर्गात फ्रेंचमध्ये बोलू नका
Predicted summary: वर्गात जास्त कठीण राहू नकोस

Review: i turned the lights out
Original summary: मी दिवे
Predicted summary: मी दोन हात वर केला

Review: how many rackets do you have
Original summary: तुझ्याकडे किती रॅकेट आहेत
Predicted summary: तुमच्याकडे किती बहिणी आहेत

Review: i gave tom marys phone number
Original summary: मी टॉमला मेरीचा फोन नंबर दिला
Predicted summary: मी टॉमला मेरीचा फोन क्रमांक दिला

Hurrah!!

Our model makes some good translations of English sentences to Marathi sentences.

Project Demo

I have deployed my model through Django and have hosted it on heroku. You can look at it here.

Ending Notes

In this story, we learned about the functionality of Attention Mechanism and implemented a Language Translation task. This task could have multiple use cases in daily lifestyles. For example, we can use this technique to build a multi-language translator that can translate various languages from a single language. Also, if we can integrate this with an Optical Character Recognition system, we can translate texts directly from images.

If you have any other use case or technique to work with translation data and also, if you find a more improved model for NMT, do share in the response block below!

The entire code for this article is available here. If you have any feedback, feel free to reach out to me on LinkedIn.

--

--

A Deep Learning enthusiast with a profound background in Computer Science. Loves learning new and creative concepts about programming, science and life.