Hands-on Tutorials

Attention and Transformer Models

A complex algorithm, simply explained

Helene_k
Towards Data Science
8 min readNov 16, 2020

--

(Image by author)

“Attention Is All You Need” by Vaswani et al., 2017 was a landmark paper that proposed a completely new type of model — the Transformer. Nowadays, the Transformer model is ubiquitous in the realms of machine learning, but its algorithm is quite complex and hard to chew on. So this blogpost will hopefully give you some more clarity about it.

The Basic Architecture

So what do all those colourful rectangles in the image below mean? In general, the Transformer model is based on an encoder-decoder architecture. The encoder is the grey rectangle on the left-hand side, the decoder the one on the right-hand side. Both the encoder and decoder consist of two and three sub-layers, respectively: multi-head self-attention, a fully-connected feed forward network and — in the case of the decoder — encoder-decoder self-attention (named multi-head attention in the visualization below).

The transformer architecture
The Transformer architecture (Source: Vaswani et al., 2017)

What cannot be seen as clearly in the picture is that the Transformer actually stacks multiple encoders and decoders (which is denoted by Nx in the image, i.e., encoders and decoders are stacked n times). This means that the output of one encoder is used as the input for the next encoder — and the output of one decoder as the input for the next decoder.

Multi-Head Self-Attention

What’s new with the Transformer is not really its encoder-decoder architecture but that it does away with traditionally used recurrent layers. Instead, it entirely relies on self-attention.

So what is self-attention? In short, it is the model’s way to make sense of the input it receives.

One of the problems of recurrent models is that long-range dependencies (within a sequence or across several sequences) are often lost. That is, if a word at the beginning of a sequence carries importance for a word at the end of a sequence, the model might have forgotten the first word once it reaches the last word. Not really that smart those RNNs, are they? ;-) Transformer models use a different strategy to memorize the whole sequence: self-attention!

Long-term dependencies in a sequence: if the model does not remember “boy” from the beginning, it might not know which pronoun to use at the end. Him, her, it? (Image by author)

In the self-attention layer, an input x (represented as a vector) is turned into a vector z via three representational vectors of the input: q(ueries), k(eys) and v(alues). These are used to calculate a score that shows how much attention that particular input should pay to other elements in the given sequence.

What I have just vaguely expressed in words can be defined by the following formula much more precisely:

(Image by author)

Since formulae are not always very intuitive, a step-by-step visualization of the calculation should make things a little clearer.

(Image by author)

Say we want to calculate self-attention for the word “fluffy” in the sequence “fluffy pancakes”. First, we take the input vector x1 (representing the word “fluffy”) and multiply it with three different weight matrices Wq, Wk and Wv (which are continually updated during training) in order to get three different vectors: q1, k1 and v1. The exact same is done for the input vector x2 (representing the word “pancakes”). We now have a query, key and value vector for both words.

The query is the representation for the word we want to calculate self-attention for. So since we want to get the self-attention for “fluffy”, we only consider its query, not the one of “pancakes”. As soon as we are finished calculating the self-attention for “fluffy”, we can also discard its query vector.

The key is a representation of each word in the sequence and is used to match against the query of the word for which we currently want to calculate self-attention.

The value is the actual representation of each word in a sequence, the representation we really care about. Multiplying the query and key gives us a score that tells us how much weight each value (and thus, its corresponding word) obtains in the self-attention vector. Note that the the value is not directly multiplied with the score, but first the scores are divided by the square root of the dk, the dimension of the key vector, and softmax is applied.

The result of these calculations are one vector for each word. As a final step, these two vectors are summed up, and voilà, we have the self-attention for the word “fluffy”.

Self-attention for the word “his”. The lines indicate how much attention the word “his” pays to other words in the sequence. (Image by author)

You may have noticed that it’s called multi-head self-attention. This is because the process above is carried out multiple times with different weight matrices, which means we end up with multiple vectors (called heads in the formulae below). These heads are then concatenated and multiplied with a weight matrix Wo. This means that each head learns different information about a given sequence and that this knowledge is combined at the end.

(Image by author)

So far, I haven’t mentioned the most important thing: all these calculations can be parallelized. Why is this a big deal? Let’s look at RNNs first. They need to process sequential data in order, i.e. each word of a sequence is passed to the model one by one, one after the other. Transformer models, however, can process all inputs at once. And this makes these models incredibly fast, allowing them to be trained with huge amounts of data. You now wonder how the Transformer knows the correct order of a sentence if it receives it all at once? I’ll explain that in the section about Positional Encodings below.

As we saw in very first picture showing the Transformer architecture, self-attention layers are integrated in both the encoder and decoder. Which just had a look at what self-attention looks like in the encoder. The decoder, however, uses what is called masked multi-head self-attention. This means that some positions in the decoder input are masked and thus ignored by the self-attention layer. Why do they get masked? When predicting the next word of a sentence, the decoder should not know which word comes after the predicted word. Instead, only words up until the current positions should be known to the decoder. After all, when actually using the model to get real next-word predictions, the decoder cannot see future positions either. So by masking them during training, we don’t allow the decoder to cheat.

In the first sentence (masked), the next word is far more difficult to predict than in the second sentence (unmasked). The words “its tail” make it clear the word to predict is probably “wiggled”. (Image by author)

One crucial aspect of the model is still missing. How does information flow from the encoder to the decoder? This is what the the encoder-decoder self-attention layer is here for. This layer works very similarly to the self-attention layer in the encoder. However, the query vector comes from the previous masked self-attention layer, the key and value vector come from the output of the top-most encoder. This allows the decoder to take into account all positions in the input sequence of the encoder.

Feed-Forward Networks

So we now know what the self-attention layers do in each encoder and decoder. This leaves us with the other sub-layer we have not talked about: the fully connected feed-forward networks. They further process he outputs of the self-attention layer before passing them on to the next encoder or decoder.

(Image by author)

Each feed-forward network consists of two linear layers with a ReLU function in between. The weights and biases W1, W2, b1 and b2 are the same across different positions in the sequence but different in each encoder and decoder.

Positional Encodings

As we already said, the Transformer model can process all words in a sequence in parallel. However, this means that some important information is lost: the word position in the sequence. To retain this information, the position and order of the words must be made explicit to the model. This is done via positional encodings. These positional encodings are vectors with the same dimension as the input vector and are calculated using a sine and cosine function. To combine the information of the input vector and the positional encoding, they are simply summed up.

For a more detailed explanation of how positional encodings exactly work, I recommend this article here.

Layer Normalization

One small but important aspect of Transformer models is layer normalization, which is performed after every sub-layer in each encoder and decoder.

(Image by author)

First, the input and the output of the respective encoder or decoder layer are summed up. This means that in the bottom-most layer, the input vector X and the output vetcor Z1 are summed up; in the second layer, the input vector Z1 and the output vector Z2, and so forth. The summed up vector is then normalized with a mean of zero and unit variance. This prevents the range of values in a given layer from fluctuating too much and thus allows the model to converge faster.

For a more in-depth explanation of layer normalization, I recommend this article here.

Final Linear Layer and Softmax

Finally, in order to get the output predictions, we somehow need to transform the output vector of the last decoder into words. So we first feed the output vector into a fully-connected linear layer and get a logits vector of the size of the vocabulary. We then apply the softmax function to this vector in order to get a probability score for each word in the vocabulary. We then chose the word with maximum probability as our prediction.

Probability scores for the next-word prediction of “The little back dog ___”. (Image by author)

Summary

The Transformer model is a new kind of encoder-decoder model that uses self-attention to make sense of language sequences. This allows for parallel processing and thus makes it much faster than any other model with the same performance. They thus paved the way for modern language models (such as BERT and GPT) and recently, also image generation models.

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser and I. Polosukhin, Attention Is All You Need (2017), NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems
[2] A. Kazemnejad, Transformer Architecture: The Positional Encoding (2019), Amirhossein Kazemnejad’s Blog
[3] L. Mao, Layer Normalization Explained (2019), Lei Mao’s Log Book

Credits

Special thanks to Philip Popien for suggestions and corrections to this article.

--

--