Text Summarization from scratch using Encoder-Decoder network with Attention in Keras

Summarizing text from news articles to generate meaningful headlines

Published in

Towards Data Science

15 min readJun 14, 2020

During our school days, most of us would have encountered the reading comprehension section of our English paper. We would be given a paragraph or Essay based on which we need to answer several questions.

How do we as humans approach this task at hand? We go through the entire text, make sense of the context in which the question is asked and then we write answers. Is there a way we can use AI and deep learning techniques to mimic this behavior of us?

Automatic text summarization is a common problem in machine learning and natural language processing (NLP). There are two approaches to this problem.

Extractive Summarization- Extractive text summarization done by picking up the most important sentences from the original text in the way that forms the final summary. We do some kind of extractive text summarization to solve our simple reading comprehension exercises. TextRank is a very popular extractive and unsupervised text summarization technique.

2. Abstractive Summarization-Abstractive text summarization, on the other hand, is a technique in which the summary is generated by generating novel sentences by either rephrasing or using the new words, instead of simply extracting the important sentences. For example, some questions in the reading comprehension might not be straightforward in such cases we do rephrasing or use new words to answer such questions.

We humans can easily do both kinds of text summarization. In this blog let us see how to implement abstractive text summarization using deep learning techniques.

Problem Statement

Given a news article text, we are going to summarize it and generate appropriate headlines.

Whenever any media account shares a news story on Twitter or in any social networking site, they provide a crisp headlines /clickbait to make users click the link and read the article.

Often media houses provide sensational headlines that serve as a clickbait. This is a technique often employed to increase clicks to their site.

Our problem statement is to generate headlines given article text. For this we are using the news_summary dataset. You can download the dataset here

A tweet from CNN with a headline for an article on COVID 19

Before we go through the code, let us learn some concepts needed for building an abstractive text summarizer.

Sequence to Sequence Model

Techniques like multi-layer perceptron(MLP) work well your input data is vector and convolutional neural networks(CNN) works very well if your input data is an image.

What if my input x is a sequence? What if x is a sequence of words. In most languages sequence of words matters a lot. We need to somehow preserve the sequence of words.

The core idea here is if output depends on a sequence of inputs then we need to build a new type of neural network which gives importance to sequence information, which somehow retains and leverages the sequence information.

Google Translate is a very good example of a seq2seq model application

We can build a Seq2Seq model on any problem which involves sequential information. In our case, our objective is to build a text summarizer where the input is a long sequence of words(in a text body), and the output is a summary (which is a sequence as well). So, we can model this as a Many-to-Many Seq2Seq problem.

A many to many seq2seq model has two building blocks- Encoder and Decoder. The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.

Generally, variants of Recurrent Neural Networks (RNNs), i.e. Gated Recurrent Neural Network (GRU) or Long Short Term Memory (LSTM), are preferred as the encoder and decoder components. This is because they are capable of capturing long term dependencies by overcoming the problem of vanishing gradient.

Encoder-Decoder Architecture

Let us see a high-level overview of Encoder-Decoder architecture and then see its detailed working in the training and inference phase.

Intuitively this is what happens in our encoder-decoder network:

1. We feed in our input (in our case text from news articles) to the Encoder unit. Encoder reads the input sequence and summarizes the information in something called the internal state vectors (in case of LSTM these are called the hidden state and cell state vectors).

2. The encoder generates something called the context vector, which gets passed to the decoder unit as input. The outputs generated by the encoder are discarded and only the context vector is passed over to the decoder.

3. The decoder unit generates an output sequence based on the context vector.

We can set up the Encoder-Decoder in 2 phases:

Training phase
Inference phase

Training phase

A.Encoder

In the training phase at every time step, we feed in words from a sentence one by one in sequence to the encoder. For example, if there is a sentence “I am a good boy”, then at time step t=1, the word I is fed, then at time step t=2, the word am is fed, and so on.

Say for example we have a sequence x comprising of words x1,x2,x3,x4 then the encoder in training phase looks like below:

The initial state of the LSTM unit is zero vector or it is randomly initiated. Now h1,c1 is the state of LSTM unit at time step t=1 when the word x1 of the sequence x is fed as input.

Similarly h2,c2 is the state of the LSTM unit at time step t=2 when the word x2 of the sequence x is fed as input and so on.

The hidden state (hi) and cell state (ci) of the last time step are used to initialize the decoder.

B.Decoder

Now the initial states of the decoder are initialized to the final states of the encoder. This intuitively means that the decoder is trained to start generating the output sequence depending on the information encoded by the encoder.

<start> and <end> are the special tokens that are added to the target sequence(in our case the headlines we want to predict)before feeding it into the decoder.

The target sequence is unknown while decoding the test sequence. So, we start predicting the target sequence by sending the first word into the decoder which would be always the <start> token. And the <end> token signals the end of the sentence.

Inference Phase

Now at the inference phase, we want our decoder to predict our output sequence(in our case headlines). After training, the model is tested on new source sequences for which the target sequence is unknown. So, we need to set up the inference architecture to decode a test sequence

At every time step, the LSTM unit in my decoder gives me outputs y¹,y²,y³…y^k. where k is the length of the output sequence. At time step t=1 output y¹ is generated, at time t=2 output y ^2 is generated and so on.

But in the testing stage as mentioned earlier we do not know what the length of our target sequence would be. How do we tackle this problem? Or in other words, how do we decode the test sequence. We follow the below steps for the same :

Encode the entire input sequence and initialize the decoder with internal states of the encoder
Pass <start> token as an input to the decoder
Run the decoder for one time step with the internal states
The output will be the probability for the next word. The word with the maximum probability will be selected.
Pass the sampled word as an input to the decoder in the next timestep and update the internal states with the current time step
Repeat steps 3–5 until we generate <end> token or hit the maximum length of the target sequence.

Disadvantages of Encoder-Decoder Network

In the encoder-decoder network, a context vector is generated by our encoder which gets passed to the decoder as input. Now if our input sequence is large ( in our case the text from news articles will be mostly large), one single context vector cannot capture the essence of the input sequence.
It is difficult for the encoder to memorize long sequences into a fixed-length vector
.The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

Researchers observed that the BLEU score deteriorates as the sentence length for the source and reference text increases. It does a reasonable job up-till a sentence length of 20, after that the score falls.

For our task both the source and target sentence length are higher than 20, hence we need to overcome this shortcoming of the encoder-decoder network.

Concept of Attention

When humans read any lengthy paragraph, they pay attention to certain words then they change their attention to the next few words and so on.
Intuitively think of your teacher correcting your 10 mark answer in your History paper. The teacher would have a key with them in which certain important points /words are there. So in your answer sheet your teacher would look for these important words in the key. More attention is paid to the keywords.
Hence humans change their attention from one sequence of words to another sequence of words when given a lengthy input sequence
So, instead of looking at all the words in the source sequence, we can increase the importance of specific parts of the source sequence that result in the target sequence. This is the basic idea behind the attention mechanism.
Attention mechanism makes use of bidirectional RNNs The regular RNN is unidirectional as the sequence is processed from the first word to the last word. In bidirectional RNN we will have a connection in forward and reverse direction.
So in addition to the forward connection, there is also a backward connection for each of the neurons.
The outputs generated from the forward and the backward connection of neurons are concatenated together to give outputs y1^,y2^and so on. So we will have two back propagations one for the forward path in the backward direction and again for the backward path in the forward direction.
The context vector is nothing but the weighted sum of outputs from the encoder.

There are 2 different classes of attention mechanism depending on the way the attended context vector is derived:

Global Attention-Here, the attention is placed on all the source positions. In other words, all the hidden states of the encoder are considered for deriving the attended context vector:
Local Attention-Here, the attention is placed on only a few source positions. Only a few hidden states of the encoder are considered for deriving the attended context vector.

We will be using global attention for our task at hand.

Code Walkthrough

Now that we have learned all the concepts lets dive deep into code. First, let us import all the necessary libraries

Custom Attention Layer

Keras does not officially support the attention layer. So, we can either implement our attention layer or use a third-party implementation. We will go with the latter option for this blog. You can download the attention layer from here and copy it in a different file called attention.py and then we can import the same.

Now let us read our dataset. Due to computational constraints we shall just load 20000 rows from our dataset.

Reading our dataset -We can see headlines and news article text pairs.

Text Preprocessing

Now we need to clean our text, we perform the following steps for the text and headlines pair:

Remove extra white spaces
Expand contractions
Remove special characters
Lowercase all texts

We use the following function to expand contractions

We preprocess the text and headline pairs using the below function:

The above code snippet does preprocessing for article text. The same code can be used for the headlines' column also.

Here the text is our Source and the headlines are our target. We need to add start and end tokens to our target sequences ie our headlines as we saw earlier.

Adding start and end tokens (sostok and eostok) for headlines.

Now we will add a new feature word count. We will add this feature for text as well as headlines. Then let us see the percentile values of word count for text and headlines. This will help us to get an overall idea about the distribution of length of the text. This will help us fix the maximum length of the sequence

Let us get the percentile values of the word count of text.

Percentile values of the word count for text

Let us see the percentile values from 90 to 100 for the word count of text

90–100 percentile values for the word count of text

We take the 95th percentile value ie 62 to be our maximum length of text. Similarly we plot the percentile values of headlines and we take 15 to be the maximum length of headlines.

max length for text and headlines

Test Train split

Before moving on to build our model, we need to do a test train split for our data. We use the sklearn to do the same. We will use 70 % of the data as training data and evaluate the performance on the remaining 30 %.

train-test split

Pretrained word embedding for text

We earlier saw that to the encoder we send the text sequences at every time step. The words from the input sequence at every time step is passed to the encoder network. Before it is passed as input to the encoder unit we add an embedding layer.

What are word embeddings?

An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

For instance, “coconut” and “polar bear” are words that are semantically quite different, so a reasonable embedding space would represent them as vectors that would be very far apart. But “kitchen” and “dinner” are related words, so they should be embedded close to each other.

Word embeddings are computed by applying dimensionality reduction techniques to datasets of co-occurrence statistics between words in a corpus of text. This can be done via neural networks (the “word2vec” technique), or matrix factorization.

Glove embeddings

One of the most popularly used embedding technique is GloVe embedding. GloVe stands for “Global Vectors for Word Representation”.We will be using GloVe embeddings for our input text sequence(you can read about GloVe here).

Specifically, we will use the 100-dimensional GloVe embeddings of 400k words computed on a 2014 dump of English Wikipedia. You can download them here

Now we need to check how many words from our input text sequence corpus are present in the Glove Corpus. This is very important as our model will be only learning the words present in the Glove corpus. We can check this by using the code below :

We observe that the percentage of words present in both glove vectors and our corpus is nearly 73% which is very good.

Tokenization

Any Machine learning or deep learning model cannot understand text directly. We need to convert it into numbers. We can do this through tokenization.

Tokenization is the process of dividing the text into a set of meaningful pieces. These pieces are called tokens. For example, we can divide a chunk of text into words, or we can divide it into sentences. A tokenizer builds the vocabulary and converts a word sequence to an integer sequence.

We tokenize the text as well as the headline, then we convert it to integers and then pad it to maximum length ( we arrived at this maximum length in the previous section)

Next, let’s build the dictionary to convert the index to word for target and source vocabulary

Now in the earlier section, we had loaded the glove model. Let us now get the weight matrix for the text sequences from the glove model. Words not found in the glove embedding would be initialized to zero vectors. We can use the following piece of code to get the weight matrix :

Building our Model

We build our Encoder by stacking 3 LSTMs vertically with the horizontal axis being the time axis. Stacking helps in increasing the model complexity and also helps in better representation of our input text sequence.

A stacked LSTM model would visually look like this :

(we are using stacking only for our Encoder architecture and not for decoder as shown in the image below.)

Stacked Architecture(Image Source-https://towardsdatascience.com/time-series-forecasting-with-deep-stacked-unidirectional-and-bidirectional-lstms-de7c099bd918)

We can use the below code to build our model:

Some important terminologies

Return Sequences = True: When the return sequences parameter is set to True, LSTM produces the hidden state and cell state for every timestep
Return State = True: When return state = True, LSTM produces the hidden state and cell state of the last timestep only
Initial State: This is used to initialize the internal states of the LSTM for the first timestep
latent_dim: It denotes the number of hidden units.

Let us have a look at our model summary :

Now we see that there are non-trainable parameters. These parameters are from weight vectors of our embedding matrix. We have set trainable as False and hence these weights will not be learned by our model.

We shall now go ahead and compile our model

Now let us checkpoint our model and save the best weights. Also, let us define early stopping for our model with patience of 3 (which means the training stops if the validation loss does not reduce after 3 epochs)

Now let us train our model with a batch size of 128 and

We plot the Epoch vs Val loss plot

We observe that the val loss does not reduce after the 18th Epoch so we stop the training.

Now we build the inference for our model as discussed in the previous sections

Now we are defining a function below which is the implementation of the inference process (decoding test sequence)

Now let us define the functions to convert an integer sequence to a word sequence for text as well as headlines :

Finally we have the prediction function, lets see how our model performs!

The following are some predictions made by our model. The predictions may not be very good as we have trained our model on fewer data points (around 15 K).

Even though the predicted and actual headlines do not match exactly in terms of words but somewhat the same meaning is conveyed. This is quite a decent result we have got having trained our model on a very small dataset. Google /Facebook/Amazon train models on Giga/Tera Bytes of data :).

Future Work

We can still improve our model to a great extent by

Increasing the size of the dataset -We took around 20k points due to computational constraints. If we have high-end GPUs available this model can be trained on a much bigger dataset thereby improving the results.
Using BERT embeddings — BERT (Bidirectional Encoder Representations from Transformers ) is a language representation model that is state of the art. We can use BERT to obtain contextualized embeddings for our text sequence. We can also use the BERT Sentence piece tokenizer to tokenize our input text sequence. BERT does require a lot of computational power though.
Use of Bidirectional LSTMs- We discussed earlier that for Encoder decoder with the attention we make use of Bidirectional LSTMs. However we built our model with unidirectional LSTM. Bidirectional LSTM is capable of capturing the context from both the directions and results in a better context vector.
Try using Beam Search for decoding sequence
Evaluate the performance of your model based on the BLEU score.