Brief Introduction to Attention Models

Abhishek Singh
Towards Data Science
4 min readSep 6, 2019

--

There have been recent developments in the field of NLP, Machine Translation and most of the State Of The Art (SOTA) results have been achieved using Attention models. Topmost models on the GLUE Benchmark Leader board use self-attention models Transformers.

To understand attention in a very basic sense, let's take an example from Prof.Andrew Ng’s course on Deep Learning-

Source — Deep Learning Coursera

Above attention model is based upon a paper by “Bahdanau et.al.,2014 Neural machine translation by jointly learning to align and translate”. It is an example of a sequence-to-sequence sentence translation using Bidirectional Recurrent Neural Networks with attention. Here symbol “alpha” in the picture above represent attention weights for each time-step of output vectors. There are several methods to compute these weights “alpha” such as using Dot Product, Neural Network Model with single hidden layer, etc. These weights are multiplied by each of the words in the source and then this product is the fed to language model along with the output from the previous layer to get the output for present timestep. These alpha values determine how much importance should be given to each word in the source to determine the output sentence.

Souce-Lilian Weng Github post.

It is the list of different functions that can be used to compute attention weights (alpha), more popularly known as alignment score. In (Additive) function (s<t>,h<i>) output from previous time-step and source embedding are concatenated and passed through a single-layered neural network with output as attention weight (alpha). This neural network model is trained in parallel to RNN model, and these weights are tweaked accordingly.

Types of Attention Models:

  1. Global and Local attention(local-m, local-p)
  2. Hard and Soft Attention
  3. Self-attention

Global Attention Model

This is the same attention model as discussed above. Input from every source state(encoder) and decoder states prior to the current state is taken into account to compute the output. Below is the diagram for the global attention model.

Source-(Luong et.al. 2015)

As seen from the figure, (a<t>) align weights or attention weights are calculated using each encoder step and (h<t>) decoder previous step. Then using (a<t>) context vector is calculated by taking the product of Global align weights and each encoder steps. It is then fed to RNN cell to find decoder output.

Local Attention Model

It is different from Global Attention Model in a way that in Local attention model only a few positions from source (encoder) is used to calculate the align weights (a<t>). Below is the diagram for the Local attention model.

Source-(Luong et.al. 2015)

It can be referred from the picture, first single-aligned position (p<t>) is found then a window of words from source (encoder) layer along with (h<t>) is used to calculate align weights and context vector.

Local Attention-It is of two types Monotonic alignment and Predictive alignment. In monotonic alignment, we simply set position (p<t>) as “t” whereas in predictive alignment, position (p<t>) instead of just assuming it as “t” it is predicted by the predictive model.

Hard and Soft Attention

Soft attention is almost the same as Global attention model.

Difference between hard and local attention model is that Local model is almost differential at every point whereas hard attention is not. Local attention is a blend of hard and soft attention. Link to study further is given at the end.

Self-attention Model

Relating different positions of the same input sequence. Theoretically the self-attention can adopt any score functions above, but just replace the target sequence with the same input sequence.

Transformer Network

Transformer network is entirely built upon self-attention mechanisms without using recurrent network architecture. The transformer is made using multi-head self-attention models.

Source- Attention is all you need.

Encoder layer consists of two sub-layers, one is multi-head attention and the next one is a feed-forward neural network. The decoder is made by three sub-layers two multi-head attention network which is then fed to the feed-forward network. Decode attention network takes input from the encoder output and output from the previous layer of the decoder.

Reference Links

  1. Attention? Attention!
  2. Effective approaches to Neural Machine Translation
  3. Coursera Attention
  4. Beginners guide to attention
  5. Glue Benchmark Leaderboard
  6. Attention on CNN

Citations

  1. Neural Machine Translation by Jointly Learning to Align and Translate
  2. Effective approaches to Neural Machine Translation.

Please comment below if something is incorrect or if I should add something more.

--

--

Software Engineer at Samsung. I like to work on Machine Learning applications.