Let’s give some ‘Attention’ to Summarising Texts..

Text Summarization using a LSTM Encoder-Decoder model with Attention.

Published in

Towards Data Science

9 min readMay 9, 2020

Text Summarization is one of the most challenging and interesting problems in the field of Natural Language Processing (NLP). It is a process of generating a concise and meaningful summary of text from multiple text resources such as books, news articles, blog posts, research papers, emails, and tweets. Now with the availability of a huge amount of text corpuses, summarization is an even more important task.

So what are the different approaches?

Extractive Summarization

These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method. Let’s understand this with an example.

Text: Messi and Ronaldo have better records than their counterparts. Performed exceptionally across all competitions. They are considered as the best in our generation.

Extractive summary: Messi and Ronaldo have better records than their counterparts. Best in our generation.

As you can see above, the words in bold have been extracted and joined to create a summary — although sometimes the summary can be grammatically strange.

Abstractive Summarization

These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text. Let’s understand this with an example.

Text: Messi and Ronaldo have better records than their counterparts. Performed exceptionally across all competitions. They are considered as the best in our generation.

Abstractive summary: Messi and Ronaldo have better records than their counterparts, so they are considered as the best in our generation.

The abstractive text summarization algorithms create new phrases and sentences that relay the most useful information from the original text — just like humans do.

In this article, we will be focusing on the abstractive summarization technique, and we will be solve it using the Encoder-decoder architecture to solve this problem.

What’s the Encoder-Decoder Architecture?

The overall structure of sequence to sequence model(encoder-decoder) which is commonly used is as shown below-

The model consists of 3 parts: Encoder, Intermediate vector and Decoder.

Encoder

The encoder basically consists of series of LSTM/GRU cells (Please go through LSTM/GRU documentation for better understanding of the architecture).
An encoder takes the input sequence and encapsulates the information as the internal state vectors.
Outputs of the encoder and the internal states are used by the Decoder.
In our text summarization problem, the input sequence is a collection of all words from the text, which needs to be summarized. Each word is represented as x_i where i is the order of that word.

Intermediate (Encoder) Vector

This is the final hidden state produced from the encoder part of the model. It is calculated using the formula above.
This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.

Decoder

A stack of several recurrent units where each predicts an output y_t at a time step t.
Each recurrent unit accepts a hidden state from the previous unit and produces and output as well as its own hidden state.
In the summarization problem, the output sequence is a collection of all words from the summarized text. Each word is represented as y_i where i is the order of that word.
Any hidden state h_i is computed using the formula:

As you can see, we are just using the previous hidden state to compute the next one.

The output y_t at time step t is computed using the formula:

We calculate the outputs using the hidden state at the current time step together with the respective weight W(S). Softmax is used to create a probability vector which will help us determine the final output (e.g. word in the question-answering problem).

Pay some Attention:

First, we need to understand what is Attention.

How much attention do we need to pay to every word in the input sequence for generating a word at timestep t? That’s the key intuition behind this attention mechanism concept.

Let’s understand this with a simple example:

Question: In the last decade, who is the best Footballer?

Answer: Lionel Messi is the best player.

In the above example, the fifth word in the question who is related Lionel Messi and ninth word Footballer is related to sixth word player.

So, instead of looking at all the words in the source sequence, we can increase the importance of specific parts of the source sequence that result in the target sequence. This is the basic idea behind the attention mechanism.

There are 2 different classes of attention mechanism depending on the way the attended context vector is derived:

Global Attention

Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector. In this Summarization task we will be using the Global Attention.

Local Attention

Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector.

Let us now understand how this Attention really works:

The encoder outputs the hidden state (hj) for every time step j in the source sequence
Similarly, the decoder outputs the hidden state (si) for every time step i in the target sequence
We compute a score known as an alignment score (eij) based on which the source word is aligned with the target word using a score function. The alignment score is computed from the source hidden state hj and target hidden state si using the score function. This is given by:

eij= score (si, hj )

where eij denotes the alignment score for the target timestep i and source time step j.

We normalize the alignment scores using softmax function to retrieve the attention weights (aij):

We compute the linear sum of products of the attention weights aij and hidden states of the encoder hj to produce the attended context vector (Ci):

The attended context vector and the target hidden state of the decoder at timestep i are concatenated to produce an attended hidden vector Si, where, Si= concatenate([si; Ci])
The attended hidden vector Si is then fed into the dense layer to produce yi, yi= dense(Si).

Let’s understand the above attention mechanism steps with the help of an example. Consider the source-text sequence to be [x1, x2, x3, x4] and target-summary sequence to be [y1, y2].

The encoder reads the entire source sequence and outputs the hidden state for every timestep, say h1, h2, h3, h4

The decoder reads the entire target sequence offset by one timestep and outputs the hidden state for every timestep, say s1, s2, s3

Target timestep i=1

Alignment scores e1j are computed from the source hidden state hi and target hidden state s1 using the score function:

e11= score(s1, h1)
e12= score(s1, h2)
e13= score(s1, h3)
e14= score(s1, h4)

Normalizing the alignment scores e1j using softmax produces attention weights a1j:

a11= exp(e11)/((exp(e11)+exp(e12)+exp(e13)+exp(e14))
a12= exp(e12)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a13= exp(e13)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))
a14= exp(e14)/(exp(e11)+exp(e12)+exp(e13)+exp(e14))

Attended context vector C1 is derived by the linear sum of products of encoder hidden states hj and alignment scores a1j:

C1= h1 * a11 + h2 * a12 + h3 * a13 + h4 * a14

Attended context vector C1 and target hidden state s1 are concatenated to produce an attended hidden vector S1

S1= concatenate([s1; C1])

Attentional hidden vector S1 is then fed into the dense layer to produce y1

y1= dense(S1)

In the same way we can compute Y2.

Attention layer is not provided by Keras, so we can write our own or use Attention Layer by by someone else. Here we are using this implementation of Attention layer.

Time for Implementation:

The whole code for this summarization task can be found here.

Show the Data:

We will be using the Amazon food reviews dataset for this article. Let’s have a snapshot of the data:

Clean the Data:

We first need to clean our data, so the steps we need to follow are:

Convert everything to lowercase
Remove HTML tags
Contraction mapping
Remove (‘s)
Remove any text inside the parenthesis ( )
Eliminate punctuations and special characters
Remove stopwords.
Remove short words

Distribution of the Data:

Then, we will analyze the length of the reviews and the summary to get an overall idea about the distribution of length of the text. This will help us fix the maximum length of the sequence.

X-axis: word-count, Y-axis: number of sentences.

Tokenize the Data:

A tokenizer builds the vocabulary and converts a word sequence to an integer sequence. We will be using the Keras’ Tokenizer for tokenizing the sentences.

Model building:

We are finally at the model building part. But before we do that, we need to familiarize ourselves with a few terms which are required prior to building the model.

Return Sequences = True: When the return sequences parameter is set to True, LSTM produces the hidden state and cell state for every timestep
Return State = True: When return state = True, LSTM produces the hidden state and cell state of the last timestep only
Initial State: This is used to initialize the internal states of the LSTM for the first timestep
Stacked LSTM: Stacked LSTM has multiple layers of LSTM stacked on top of each other. This leads to a better representation of the sequence. I encourage you to experiment with the multiple layers of the LSTM stacked on top of each other (it’s a great way to learn this).

Training and Early Stopping:

This is how the loss decreases during training, we can infer that there is a slight increase in the validation loss after epoch 10. So, we will stop training the model after this epoch.

Inference:

Now, we will set up the inference for the encoder and decoder. Here the Encoder and the Decoder will work together, to produce a summary. The Decoder will be stacked above the Encoder, and the output of the decoder will be again fed into the decoder to produce the next word.

Testing:

Here, finally, we can test our model with our custom inputs.

Review: right quantity japanese green tea able either drink one sitting save later tastes great sweet  
Original summary: great japanese product  
Predicted summary:  great teaReview: love body wash smells nice works great feels great skin add fact subscribe save deal great value sold
Original summary: great product and value  
Predicted summary:  great productReview: look like picture include items pictured buy gift recipient disappointed  
Original summary: very disappointed  
Predicted summary:  not what expected

Here is the colab-notebook for this article.

Wrapping it up

In this article, we have seen how we can summarize texts using a Sequence-to-Sequence model. We can further improve this model by increasing the dataset, use Bidirectional LSTM, use Beam Search strategy etc.

In our next story, we will see how we can implement it with Transfer Learning. We will use pre-trained GloVe word embeddings, and see how our model behaves and if it is able to understand the semantics better with the pre-trained embeddings. See you there.