
"I need attention. I like the attention." — Bill Foley
Introduction
In this article, we will analyze the structure of a Classic Sequence-to-Sequence (Seq2Seq) model and demonstrate the advantages of using Attention decoder. These two concepts will lay the foundation for understanding The Transformer proposed in the paper Attention is All You Need.
Table of contents:
- What is a Seq2Seq model?
- How does a classic Seq2Seq model work?
- Attention
What is a Seq2Seq model?
In a Seq2seq model, a neural machine translation receives an input in the form of a word sequence and generates a word sequence as output. From example, "Cosa vorresti ordinare?" in Italian as input, becomes "What would you like to order?" as output in English. Alternatively, the input can be an image (image captioning) or a long sequence of words (text summarization).

How does a classic Seq2Seq model work?
A Seq2Seq model usually consists of:
- an Encoder
- a Decoder
- a Context (vector)
Please note: In neural machine translation, both the encoder and the decoder are RNNs
The encoder processes all the inputs by transforming them into a single vector, called context (usually with a length of 256, 512, or 1024). The context contains all the information that the encoder was able to detect from the input (remember that the input is the sentence to be translated in this case). Finally, the vector is sent to the decoder which formulates the output sequence.
Time steps in neural machine translation
Now that we have a high-level overview of the sequence to sequence model, let’s briefly analyze how the input is processed.

Time step #1: The italian word "Stai" is sent to the ENCODER The ENCODER updates its hidden state (h1) based on its inputs and previous inputs. Time step #2: The word "attento" is sent to the ENCODER The ENCODER updates its hidden state (h2) based on its inputs and previous inputs. Time step #3: The word "Thomas" is sent to the ENCODER The ENCODER updates its hidden state (h3) based on its inputs and previous inputs. Time step #4: The last hidden state becomes the context that is sent to the DECODER The DECODER produces the first output "Be" Time step #5: The DECODER produces the second output "careful" Time step #6: The DECODER produces the third output "Thomas"
Each step for the encoder or decoder is that RNN processing its inputs and generating output for that time step. As you may notice, our last hidden state (h3) becomes the content that is sent to the decoder. In this resides the limitation of classic sequence to sequence models; the encoder is "forced" to send only a single vector, regardless of the length of our input i.e. how many words compose our sentence. Even if we decide to use a large number of hidden units in the encoder with the aim of having a larger context, then the model overfits with short sequences, and we take a performance hit as we increase the number of parameters.
This is the problem that attention solves!
Attention
At this point, we understand that the problem to be solved lies in the context vector. This is because, if the input were a sentence consisting of a considerable number of words, the model would be in trouble. A solution was proposed by Bahdanua et al., and Loung et al. These two publications introduced and refined the concept of "Attention". This technique allowed for a considerable improvement in machine translation systems by focusing on the relevant parts of the input sequence.
_IntuitionT_he encoder in the Seq2Seq model with Attention works similarly to the classic one. This receives one word at a time and produces the hidden state which is used in the next step. Subsequently, unlike before, not only the last hidden state (h3) will be passed to the decoder, but all the hidden states.

Let’s focus better on the processes that take place inside the encoder and Attention decoder.
Encoder
Before arriving at the encoder, every single word of our sentence is transformed into a vector (with a size of 200, or 300) through an embedding process. The first word, "Stai" in our case, once it converts to a vector is sent to the Encoder. Here the first step of the RNN produces the first hidden state. The same scenario occurs for the second, and third word, always taking into account the previous hidden state. Once all the words of our sentence have been processed, the hidden states (h1, h2, h2) will be passed to the Attention decoder.

Attention Decoder
First of all, an important process takes place in the Attention Decoder:
- Each hidden state is assigned a score.
- The scores go through a softmax function.
- Hidden states and related softmax scores are multiplied with each other
- Finally, the hidden states obtained are added to obtain a single vector, the context vector.
This process allows us to amplify the important parts of our sequence and reduce the irrelevant parts. At this point, we have to understand how the score is assigned to each hidden state. Do you remember Bahdanau and Luong? Well, to better understand what happens inside the Attention Decoder and how the scores are assigned, we need to say something more about Multiplicative Attention.
Multiplicative Attention was developed by exploiting the previous work done for Additive Attention. In the paper "Effective Approaches to Attention-based Neural Machine Translation," Loung introduced several scoring functions:
- dot product
- general product
- concat product

In this article, we will analyze the general product Attention (2). This is because, in our case, once it is established that each language tends to have its own embedding space, the encoder and the decoder do not have the same embedding space.
Thanks to this product, the score will be obtained by multiplying the hidden state of the decoder, weight matrix, and the set of hidden states of the encoder.
Now that we know how the score can be calculated, let’s try to understand how the Attention decoder works in a Seq2Seq model.
At the first time step the attention decoder RNN takes in the embedding of the token, and an initial decoder hidden state. The RNN processes its inputs, producing an output and a new decoder hidden state vector (h4). The output is discarded. From here begins the Attention step:

1- Each encoder hidden state is assigned a score obtained from general production Attention.

2- The scores go through a softmax function.

3- Encoder hidden states and related softmax scores are multiplied. The hidden states obtained are added to obtain the context vector (c4).

4- The context vector (c4) is concatenated with the decoder hidden state (h4). The vector originating from the concatenation is passed through a fully connected neural network which is basically multiplying by the weights matrix (Wc) and apply a tanh activation. The output of this fully connected layer would be our first outputted word in the output sequence (input: "Stai" -> output: "Be").

The second time step begins with the output of the first step ("Be") and with the decoder hidden state (h5) produced. All followed by the Attention step described above. Repeat the process described for the following time steps.

Conclusion
Congratulations if you managed to get here! Big thanks for the time spent reading this article. I hope this article has given you a good initial understanding of classic Seq2Seq model, and Seq2Seq with Attention model. If you have noticed any mistakes in the way of thinking, formulas or images, please let me know. Last but not least, if you want to deepen the topics covered, I leave you some very useful resources below:
- C5W3L07 Attention Model Intuition by Andrew Ng
- An Attentive Survey of Attention Models by Chaudhari et al.
- Visualizing a Neural Machine Translation Model by Jay Alammar
- Deep Learning 7. Attention and Memory in Deep Learning by DeepMind
Thanks again for reading my article. For any questions or information, you can contact me on LinkedIn, or leave a comment below.