A step towards general NLP with Dynamic Memory Networks

Solving different NLP tasks with Dynamic Memory Networks and a Question-Answer format

Anusha Lihala

Published in

Towards Data Science

6 min readApr 29, 2019

I assume you are already familiar with Recurrent Neural Networks such as LSTMs and GRUs (including the seq2seq encoder-decoder architecture).

An obstacle for general NLP is that different tasks (such as text classification, sequence tagging and text generation) require different sequential architectures. One way to deal with this problem is to view these different tasks as question-answering problems. So, for example, the model could be asked what the sentiment for a piece of text is (traditionally a text classification problem) and the answer could be one of ‘positive’, ‘negative’ or ‘neutral’.

The paper “Ask Me Anything: Dynamic Memory Networks for Natural Language Processing” introduces a new, modularised architecture for question-answering.

For complex question-answering problems, the memory component of LSTMs and GRUs can serve as a bottleneck. It is difficult to accumulate all relevant information in the memory component in one pass, and hence, a key idea behind the paper is to allow the model access to the data as many times as required.

Although the architecture looks extremely complex at first glance, it can be broken down into a number of simple components.

The Modules

Semantic Memory Module

The semantic memory module simply refers to word embeddings, such as Glove vectors, which the input text is transformed into before being passed to the input module.

Input Module

The input module is a standard GRU (or BiGRU), where the last hidden state of each sentence is explicitly accessible.

Question Module

The question module is also a standard GRU where the question to be answered is fed as input and the last state hidden state is accessible.

Episodic Memory Module

This is the module which conducts multiple passes over the input data. On each pass, sentence embeddings from the input module are fed as input to the GRU in the episodic memory module. Here each sentence embedding is assigned a weight corresponding to its relevance to the question being asked.

Different weights may be assigned to the sentence embeddings on different passes. For instance, in the example below;

As sentence (1) is not directly related to the question, it may not be given a high weight on the first pass. However, on the first pass, the model finds that the football is connected to John, and hence in the second pass sentence (1) is given a higher weight.

For the first pass (or first ‘episode’), the question embedding ‘q’ is used to compute attention scores for the sentence embeddings from the input module.

The attention score of sentence sᵢ can then be passed through a softmax (so that the attention scores sum to one) or an individual sigmoid to obtain gᵢ .
gᵢ is the weight given to sentence sᵢ, and acts as a global gate over the GRU’s output at timestep i.

The hidden state for timestep i and episode t is computed as;

When g = 0, the hidden state is simply copied forward. That is,

The last hidden state of the GRU for episode t, referred to as mᵗ, can be viewed as an agglomeration of the facts found during episode t. From the second episode onwards, mᵗ is used to compute attention scores over the sentence embeddings for episode t+1, together with the question embedding q.

The calculations are as follows;

A number of simple similarity measures, namely element-wise multiplication and absolute difference, are calculated between sᵢ and q, and sᵢ and mᵗ⁻¹. The concatenated results are then passed through a 2-layer neural network to compute the attention score for sᵢ . For the first episode, m⁰ is replaced with q.

The number of episodes can either be a fixed, predefined number or determined by the network itself. In the latter case, a special end-of-passes representation is appended to the input. If this vector is chosen by the gate function, then the iteration is stopped.

Answer Module

The answer module consists of a decoder GRU. At each timestep, the previous output concatenated with the question embedding is fed as input.

The output is generated using a standard softmax over the vocabulary.

The decoder is initialised via a function over the m vectors (the last hidden states of the GRU computations from the episodic memory module).

Application to Sentiment Analysis

The model achieved state of the art results for sentiment analysis at the time of publication.

For the example below, the model pays attention to all the adjectives and ultimately produces an incorrect prediction when only 1 pass is allowed. However, when 2 passes are allowed, the model pays significantly higher attention to the positive adjectives on the second pass and produces a correct prediction.

Analysis of Attention for Sentiment: [1]

Performance on Other Datasets

Switching Modules

An important benefit of modularity is that it is possible to replace one module with another without modifying any other modules, as long as the replacement module has the correct interface.

The paper “Dynamic Memory Networks for Visual and Textual Question Answering” demonstrates the use of Dynamic Memory Networks to answer questions based on images.

The input module was replaced with another which extracted feature vectors from images using a CNN based network. The extracted feature vectors were then fed to the episodic memory module, as before.

Attention visualisations of answers to some questions: [2]

Additional Resources

References

[1] K. Ankit, O. Irsoy, J. Su, J. Bradbury, R. English, B. Pierce, P. Ondruska, I. Gulrajani and R. Socher, Ask Me Anything: Dynamic Memory Networks for Natural Language Processing, ICML (2016).

[2] C. Xiong, S. Merity and R. Socher, Dynamic Memory Networks for Visual and Textual Question Answering, ICML (2016).