Attention and its Different Forms

An overview of generalised attention with its different types and uses.

Published in

Towards Data Science

6 min readMar 29, 2019

I assume you are already familiar with Recurrent Neural Networks (including the seq2seq encoder-decoder architecture).

The Bottleneck Problem

In the encoder-decoder architecture, the complete sequence of information must be captured by a single vector. This poses problems in holding on to information at the beginning of the sequence and encoding long-range dependencies.

The core idea of attention is to focus on the most relevant parts of the input sequence for each output.
By providing a direct path to the inputs, attention also helps to alleviate the vanishing gradient problem.

Computing Attention

Assume you have a sequential decoder, but in addition to the previous cell’s output and hidden state, you also feed in a context vector c.

Where c is a weighted sum of the encoder hidden states.

Here αᵢⱼ is the amount of attention the ith output should pay to the jth input and hⱼ is the encoder state for the jth input.

αᵢⱼ is computed by taking a softmax over the attention scores, denoted by e, of the inputs with respect to the ith output.

where

Here f is an alignment model which scores how well the inputs around position j and the output at position i match, and sᵢ₋₁ is the hidden state from the previous timestep.

The alignment model can be approximated by a small neural network, and the whole model can then be optimised using any gradient optimisation method such as gradient descent.

Graphic illustration of the attention mechanism (Source)

The context vector cᵢ can also be used to compute the decoder output yᵢ.

Application: Machine Translation

Attention was first proposed by Bahdanau et al.[1] for Neural Machine Translation. The mechanism is particularly useful for machine translation as the most relevant words for the output often occur at similar positions in the input sequence.

The matrix above shows the most relevant input words for each translated output word.
Such attention distributions also help provide a degree of interpretability for the model.

Generalised Attention

Given a query q and a set of key-value pairs (K, V), attention can be generalised to compute a weighted sum of the values dependent on the query and the corresponding keys.
The query determines which values to focus on; we can say that the query ‘attends’ to the values.

In the previous computation, the query was the previous hidden state sᵢ₋₁ while the set of encoder hidden states h₀ to hₙ represented both the keys and the values.

The alignment model, in turn, can be computed in various ways.

Self Attention

With self-attention, each hidden state attends to the previous hidden states of the same RNN.

Here sₜ is the query while the decoder hidden states s₀ to sₜ₋₁ represent both the keys and the values.

Application: Language Modeling

The paper ‘Pointer Sentinel Mixture Models’[2] uses self-attention for language modelling.

The basic idea is that the output of the cell ‘points’ to the previously encountered word with the highest attention score. However, the model also uses the standard softmax classifier over a vocabulary V so that it can predict output words that are not present in the input in addition to reproducing words from the recent context.

The probability assigned to a given word in the pointer vocabulary distribution is the sum of the probabilities given to all token positions where the given word appears

where I(w, x) results in all positions of the word w in the input x and pₚₜᵣ∈ Rⱽ. This technique is referred to as pointer sum attention.

The model combines the softmax vocabulary distribution with the pointer vocabulary distribution using a gate g which is calculated as the product of the query and a sentinel vector.

(Source)

Application: Summarisation

The paper ‘A Deep Reinforced Model for Abstractive Summarization’[3] introduces a neural network model with a novel self-attention that attends over the input and continuously generated output separately,

The computations involved can be summarised as follows.

Multi-Head Attention

Multiple Queries

When we have multiple queries q, we can stack them in a matrix Q.

If we compute alignment using basic dot-product attention, the set of equations used to calculate context vectors can be reduced as follows.

Multi-head attention takes this one step further.

Q, K and V are mapped into lower dimensional vector spaces using weight matrices and then the results are used to compute attention (the output of which we call a ‘head’).

We have h such sets of weight matrices which gives us h heads.

The h heads are then concatenated and transformed using an output weight matrix.

Transformers

The Transformer was first proposed in the paper ‘Attention Is All You Need’[4]. It is based on the idea that the sequential models can be dispensed with entirely, and the outputs can be calculated using only attention mechanisms.

The Transformer uses word vectors as the set of keys, values as well as queries.

Transformer’s Multi-Head Attention block (Source)

It contains blocks of Multi-Head Attention, while the attention computation itself is Scaled Dot-Product Attention.

where dₖ is the dimensionality of the query/key vectors.

The scaling is performed so that the arguments of the softmax function do not become excessively large with keys of higher dimensions.

Below is the diagram of the complete Transformer model along with some notes with additional details. For more in-depth explanations, please refer to the additional resources.