The world’s leading publication for data science, AI, and ML professionals.

The Transformer Model

A Step by Step Breakdown of the Transformer's Encoder-Decoder Architecture

source
source

Introduction

In 2017, Google researchers and developers released the paper "Attention is All You Need" that highlighted the rise of the Transformer model. In their paper, the transformer achieved new state of the art for translation tasks over previous natural language processing (NLP) models architectures. Given their current dominance in the field of NLP, this article dives into the details of the transformer’s architecture with the objective to highlight what makes it such a powerful model.

The General Architecture

The architecture of the transformer model inspires from the Attention mechanism used in the encoder-decoder architecture in RNNs to handle sequence-to-sequence (seq2seq) tasks, yet by eliminating the factor of sequentiality; meaning that, unlike RNNs, the transformer does not process data in sequence (i.e. in order), which allows for more parallelization and reduces training time.

Figure 1, illustrates the overall architecture of the transformer. The transformer is made up of two main components:

  1. The encoder stacks – Nx identical layers of encoders (in the original paper Nx = 6)
  2. The decoder stacks – Nx identical layers of decoders (in the original paper Nx =6)

Since the model does not contain any recurrence or convolution, it adds a positional encoding layer at the bottom of the encoder and decoder stacks to take advantage of the order of the sequence.

Figure 1: General Overview of the Transformer Architecture - Image on the Left is a Simplified Form ([source](https://arxiv.org/pdf/1706.03762.pdf)) and Image on the Right is the Detailed Architecture (source)
Figure 1: General Overview of the Transformer Architecture – Image on the Left is a Simplified Form ([source](https://arxiv.org/pdf/1706.03762.pdf)) and Image on the Right is the Detailed Architecture (source)

A Deep Dive Into the Transformer

This section details the different components of the Transformer by explaining the steps that the inputs go through to generate the outputs.

In this article, the classical example of translating from English to French using the transformer is considered. Input sentence is as such "I am a student", and the expected output is "Je suis un étudiant".

The Encoder

We will start by taking a closer look at the encoder side, and discover what is happening at each step.

The inputThe raw data is an english text, however the transformer, like any other model, does not understand english language and, thus, the text is processed to convert every word into a unique numeric ID. This is done by using a specific dictionary of vocabulary, which can be generated from the training data, and that maps each word to a numeric index.

Figure 2: Numerical Representation of the Raw Text (Image by Author)
Figure 2: Numerical Representation of the Raw Text (Image by Author)

Embedding Layer

As in other models, the transformer uses learned embeddings to transform the input tokens into vectors of dimension d = 512. During training, the model updates the numbers in the vectors to better represent the input tokens.

Figure 3: Embeddings of d=512 by The Embedding Layer (Image by Author)
Figure 3: Embeddings of d=512 by The Embedding Layer (Image by Author)

Positional Encoding

One aspect that differentiates the transformer from previous sequence models is that it does not take in the input embeddings sequentially; on the contraire, it takes in all the embeddings at once. This allows for parallelization and significantly decreases training time. However, the drawback is that it loses the important information related to words’ order. For the model to preserve the advantage of words’ order, positional encodings are added to the input embeddings. Since the positional encodings and embeddings are summed up, they both have the same dimension of d = 512. There are different ways to choose positional encodings; the creators of the transformer used sine and cosine functions to obtain the positional encodings. At even dimension indices the sine formula is applied and at odd dimension indices the cosine formula is applied. Figure 4, shows the formulas used to obtain the positional encodings.

Figure 4: Positional Encodings Formula (source)
Figure 4: Positional Encodings Formula (source)
Figure 5: Adding Positional Encodings to the Embeddings to Generate Positional Embeddings (ep) (Image by Author)
Figure 5: Adding Positional Encodings to the Embeddings to Generate Positional Embeddings (ep) (Image by Author)

The Multi-Head Attention Layer – Self-Attention

Figure 6: The Multi-Head Attention Layer (source)
Figure 6: The Multi-Head Attention Layer (source)

There are two terms that need to be addressed in this section, self-attention and multi-head.

Self-Attention:

We will start by looking at what self-attention is and how it is applied. The goal of self-attention is to capture contextual relationships between words in the sentence by creating an attention-based vector of every input word. The attention-based vectors help to understand how relevant every word in the input sentence is with respect to other words in the sentence (as well as itself).

The scale dot-product attention illustrated on the left side of figure 6 is applied to calculate attention-based vectors. Below is a detailed explanation of how these vectors are created from the positional embeddings.

The first step is to obtain the Query (Q), Keys (K) and Values (V). This is done by passing the same copy of the positional embeddings through three different linear layers, as seen in the figure below.

Figure 7: Generating the Query, Key and Value (Image by Author)
Figure 7: Generating the Query, Key and Value (Image by Author)

The second step is to create an attention filter from the Query (Q) and the Key (K). The attention filter will indicate how much each word is attended to at every position. It is created by applying the formula found in figure 8.

Figure 8: Generating an Attention Filter from the Query (Q) and the Key (K) (Image by Author)
Figure 8: Generating an Attention Filter from the Query (Q) and the Key (K) (Image by Author)

Finally, to obtain an attention-based matrix (the final output of the self-attention layer), a matrix to matrix multiplication (matmul) is done between the attention filter and the Value (V) matrix generated previously. Resulting in the following final formula:

Figure 9: Scaled-Dot Product Attention (source)
Figure 9: Scaled-Dot Product Attention (source)

Multi-Head Attention:

As seen on the right side of figure 6, the scaled-dot product attention (i.e. self-attention) is not applied only once, but also several times (in the original paper it is applied 8 times). The objective is to generate several attention-based vectors for the same word. This helps the model to have different representations of the words’ relations in a sentence.

The different attention-based matrices generated from the different heads are concatenated together and passed through a linear layer to shrink the size back to that of a single matrix.

Residual Connections, Add & Norm and the Feed-Forward Network

As one can notice from figure 1, the architecture includes residual connections (RC). The residual connections’ goal is avoid loss of important information found in old information by allowing these information to bypass the multi-head attention layer. Therefore, the positional embeddings are added to the output of the multi-head attention and then normalized (Add & Norm) before passing it into a regular feed-forward network.

Decoder

The decoder side has a lot of shared components with the encoder side. Therefore, this section will not be as detailed as the previous one. The main differences between the decoder and the encoder are that the decoder takes in two inputs, and applies multi-head attention twice with one of them being "masked". Also, the final linear layer in the decoder has the size (i.e. number of units) equal to the number of words in the target dictionary (in this case the french language dictionary). Each unit will be assigned a score; the softmax is applied to convert these scores into probabilities indicating the probability of each word to be present in the output.

Inputs

The decoder takes in two inputs:

  1. The output of the encoder – these are the keys (K) and the values (V) that the decoder performs multi-head attention on (the second multi-head attention in figure 1). In this multi-head attention layer, the query (Q) is the output of the masked multi-head attention.
  2. The output text shifted to the right – This is to ensure that predictions at a specific position "i" can only depend at positions less than i (see figure 10). Therefore, the decoder will take in all words already predicted (position 0 to i-1) before the actual word to be predicted at position i. Note that the first generated word passed to the decoder is the token and the prediction process continues until the decoder generates a special end token .
Figure 10: Outputs Shifted by Right as Inputs to the Decoder In the Inference Stage (Image by Author)
Figure 10: Outputs Shifted by Right as Inputs to the Decoder In the Inference Stage (Image by Author)

Masked Multi-Head Attention

The process of the masked multi-head attention is similar to that of the regular multi-head attention. The only difference is that after multiplying the matrices Q and K, and scaling them, a special mask is applied on the resulting matrix before applying the softmax (see left diagram of figure 6-Mask opt.). The objective is to have every word at a specific position "i" in the text to only attend to every other position in the text up until its current position included (position 0 until position i). This is important in the training phase, as when predicting the word at position i+1, the model will only pay attention to all the words before that position. Therefore, all positions after i, are masked and set to negative infinity before passing them to the softmax operation, which results in 0s in the attention filter (see figure 11).

Figure 11: Masked-Attention Filter (Image by Author)
Figure 11: Masked-Attention Filter (Image by Author)

Conclusion

The Transformer model is a Deep Learning model that has been in the field for five years now, and that has lead to several top performing and state of the art models such as the BERT model. Giving its dominance in the field of NLP and its expanding usage in other fields such as computer vision, it is important to understand its architecture. This article covers the different components of the transformer and highlights their functionalities.

Important Resources

Attention is All You Need (A. Vaswani et al., 2017)

The Illustrated Transformer (J. Alammar, 2018)


Related Articles