Attending to Attention

A summary of a revolutionary paper “Attention is All You Need” and Implementing the Transformer using PyTorch

Published in

Towards Data Science

7 min readNov 12, 2021

Vincent van Gogh *Self-Portrait with Grey Felt Hat*, Winter 1887/88. (Source)

I have been a Machine Learning Engineer for almost 4 years now, I started with what is now called the “Classical Models”, Logistic, Tree-based, Baysian, etc, and since last year has moved into Neural Networks and Deep Learning. I would say I did pretty well, that was until my attention lay on “Attention” (pun intended). I tried reading through tutorials, lectures, guides but nothing ever fully helped me grasp the core idea.

So I decided it is time to face bull head-on, and sat down and read the paper from arxiv, and wrote my implementation through Python and PyTorch. This helped me get a full understanding of the core concepts. Now, as they say, the best way to check your learning is to try and explain it to someone else, so here I am trying to break down what I learned.

History

To understand why Transformers are all the hype today, we must understand what happened before it came into the picture. The dominant approach was using Recurrent Neural Networks like LSTM or GRU to implement a sequence-to-sequence model, where an encoder was used to create a context vector that would contain all the information of the source sequence, and then the decoder will use this context vector to generate new tokens sequence by sequence.

The problem, as the above table shows, with this method was that it was unable to encode pairwise encodings into the context vector. What it means is that when we need to look at multiple tokens across the sequence to figure the prediction, the RNN based models do not do so well, as they cannot encode pairwise relationships.

Architecture

In December of 2016, the Google Brains team came up with a new way to model sequences called Transformers presented in their paper Attention is all you need. The impact of this paper continues as most Language models use this approach including some of the industry favorites like BERT and GPT-2

The transformer design consists of an Encoder and a Decoder, both comprise of Multi-Head Attention modules and FeedForward modules, model uses residual connections to concatenate with results and normalized to improve training. The sublayer of the attention module and feedforward module is then repeated N times for both Encoder and Decoder.

Encoder

The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection around each of the two sub-layers, followed by layer normalization. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension 512. (Source)

The input tokens pass through the token embedding layer and as the model does not use a recurrent layer it uses the positional embeddings to get the positional information on the tokens, these two are added together. The positional embedding layer does not concern with the tokens but the position of the token. The results from token embeddings and positional embeddings are summed elementwise and the resulting vector is then scaled by the square root of the hidden dimension of the model.

The scaling is done to reduce the variance in the embedding vector as it makes it difficult for the model to train. The dropout is applied to the final embedding vector. The embedding vector is then applied through N encoding layers to get the final context vector which is used by the decoder. The source mask is similar to the source vector, it contains 1 when the token is not <pad> and 0 otherwise, this is done to avoid the attention layer from focusing on the padding token.

Initially, the embedding vector is passed through the Multi-Head Attention layer, the result along with residual connection are elementwise summed and passed through layer normalization. The Multi-Head-Attention is passed the source sentence as the key, value, and query (more on that later), this is done so that the attention network will attend to the source sentence itself, therefore the name Self-Attention.

The resulting vector is then passed through a position-wise feedforward network and along with the normalization layer along with its residual connection. The results are then passed on to the next layer.

Attention

Attention has been in use before this paper came along, but the way it was implemented here has been crucial in its success and vast adoption. Instead of serving as a support module to enhance the context vector, this architecture uses it at its core as an expert system.

Single Dot Product Attention

A single attention head takes 3 values as input namely Query(Q), Key(K), and Value(V). One can think of attention as a function which maps given Query to Key-Value pairs, and the corresponding result can be thought of as weighted values that describe which key-value pair is more important to query. This value is then passed through the softmax function to get normalized weights and then the dot product is taken with values.

Multi-Head Attention

Instead of applying Query, Key, and values through a single dot product attention, we divide the dimension up into h components. The computation happens in parallel and the results are then concatenated to get the final result. This allows the model to learn multiple concepts together rather than focus on a single concept.

One analogy that helps me understand this goes like this. Imagine a room consisting of experts in it, we don’t know currently who is an expert in which topic, so we pass a question into the room and get the result from each expert with the probability of confidence for each expert. Now initially each expert has an equal confidence score assigned for any and every topic. But as we backpropagate and learn over examples, we figure out, which experts answer is more useful for which topic.

For example, when I pass a question on cars, each expert provides their advice, but over time, we know whose answer is more useful for this topic, similarly some other experts can be useful for other topics. In this example, a single dot product attention is an expert, our multi-head attention layer is the room full of experts, and our Query, Key, and Values are questions we ask.

Decoder

The decoder works similarly to the encoder and uses an attention mechanism on target tokens instead of source tokens. Except for the part where it has two Multi-Head Attention layers. The first uses targets embeddings and the next use encoder output as Key-Value pair and previous layers output as Query.

Before the decoder module, we pass the targets through a standard embedding and perform elementwise sum with positional encoding embeddings, these perform a similar job as in encoder. The result is then passed through N Decoder layers, point to be noted here is that paper never specified that the number of layers in Encoder and Decoder has to be the same.

The decoder layer consists of two Multi-Head Attention layers, one self-attention, and another encoder attention. The first takes target tokens as Query and Key-Value pairs and performs self-attention, while the other takes the output of self-attention layer as Query and Encoder Output as Key-Value pair.

The first attention module uses a target sequence mask, this is done to prevent the model from being able to see the next token in sequence as we process all the tokens in parallel. The second attention layer uses the output of the self-attention layer as Query and Encoder Output as Key-Value pair, the module is also provided with a source mask, this is done to prevent the model from paying attention to <pad> tokens.

Both Multi-Head Attention layers are followed by residual connection and Dropout layer which feeds into a layer normalization module. The result is then passed through as a position-wise feedforward network and another set of residual connection and layer normalization modules the result is then passed to the next layer.

Conclusion

Understanding this paper has allowed me to understand the new transformers that are being published lately like BERT, Roberta, GPT-3, etc. It has also given me the confidence to read and implement more papers on my own. I hope this has been a useful full read for you. Please let me know if there are any mistakes on my part, as that would help me improve my understanding of the concept.

Hope you enjoyed the article.

You can find the implementation on my GitHub
You can follow me on Linkedin
You can read my other articles on Medium