Transformer models have revolutionised the field of Natural Language Processing but, how did it all start? To understand current state-of-the-art architectures and genuinely appreciate why these models became a breakthrough in the field, we must go even further in time, where NLP as we know it started: when we first introduced Neural Networks in NLP.
The introduction of Neural models to NLP found ways to overcome challenges that traditional methods couldn’t solve. One of the most remarkable advances were Sequence-to-Sequence models: Such models generate an output sequence by predicting one word at a time. Sequence-to-Sequence models encode the source text to reduce ambiguity and achieve context-awareness.

In any language task, context plays an essential role. To understand what words mean, we have to know something about the situation where they are used. Seq2Seq models achieve context by looking at a token level: previous word/sentences to generate the next words/sentences. The introduction to this representation of context embedded in space had multiple advantages such as avoiding data sparsity due to similar context data being mapped close to each other and providing a way to generate synthetic data.
However, context in language is very sophisticated. Most times, you can’t find context by only focusing on the previous sentence. There is a need for long range dependencies to achieve context awareness. Seq2Seq models work with Recurrent Neural Networks: LSTM or GRU. These networks have memory mechanisms to regulate the flow of information when processing sequences to achieve a "long term memory." Despite this, if a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones.
RNNs will fall short when trying to process entire paragraphs of text. They suffer from the vanishing gradient problem. Gradients are values used to update the weights of a neural network and thus, learn. The vanishing gradient problem occurs when the gradient shrinks as it backpropagates through time. If a gradient value becomes extremely small, it doesn’t contribute much to learning. Moreover, RNNs’ topology is very time-consuming because, for every backpropagation step, the network needs to see the entire sequence of words.

As a way to try to fix these problems, the use of Convolutional Neural Networks was introduced in NLP. Using convolutions to create a logarithmic path. The network can "observe" the entire sequence in a log number of convolutional layers. However, this raised a new challenge: positional bias. How do we make sure that the positions we are observing in the text are the ones that give more insights? Why focus on position X of the sequence and not X-1?
Besides, the challenge it’s not only to find a way of encoding a large amount of text sequences but also to be able to determine which parts of that text are essential to gain context-awareness. Not all the text is equally important to gain understanding. To address this, the attention mechanism was introduced in Seq2Seq models.
Attention mechanism is inspired in the visual attention animals have were they focus on specific parts of their visual inputs to compute adequate responses. Attention used in Seq2Seq architectures seeks to give more contextual information to the decoder. At every decoding step, the decoder is informed how much "attention" it should give to each input word.

Despite the improvements in context awareness, there was still a substantial opportunity to improve. The most significant drawback of these methods is the complexity of these architectures.
This is where the transformer model came into the picture. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
This model removes recurrence, it only uses matrices multiplications. It processes all the inputs at once without having to process it in sequential manner. To avoid losing order, it uses positional embeddings that provide information about the position in the sequence of each element. And despite removing recurrence it still provides an encoder-decoder architecture such as the one seen in Seq2Seq models
So, after seeing all the challenges we face with previous models, let’s dive deep into what the transformer model solves in comparison to Seq2Seq models.
Transformer technical deep dive

While RNN fell short when we needed to process entire paragraphs to gain context, transformers are able to identify long-range dependencies achieving context-awareness. We also saw that RNN by themselves struggle in determining which parts of the text give more information, to do so they needed to add an extra layer, a bidirectional RNN to implement the attention mechanism. On the contrary, the transformer works only with attention so that it can determine the essential parts of context at different levels
Another critical difference is that the transformer model removes recurrence. By eliminating the recurrence, the number of sequential operations is reduced, and the computational complexity is decreased. In RNNs for every backpropagation step, the network needs to see the entire sequence of words. In the transformer, all the input is processed at once decreasing computational complexity. This also brings a new advantage, we can now parallelise training. Being able to split training examples into several tasks to process independently boosts the training efficiency.
So how does the model keep the sequence order without using recurrence?
Using Positional embeddings. The model takes a sequence of n word embeddings. To model position information, a positional embedding is added to each word embedding.

Positional embeddings are created using sine and cosine functions with different dimensions. Words are encoded with the pattern created by the combination of these functions; this results in a continuous binary encoding of positions in a sequence.
The transformer model uses multihead attention to encode the input embeddings, when doing so, it attends inputs in a forward and backward matter so the order in the sequence is lost. Because of this, it relies on positional embeddings that we just explained.
The transformer has three different attention mechanisms: the encoder attention, the encoder-decoder attention and the decoder attention. So how does the attention mechanism works? It is basically a vector multiplication, where depending on the angle of the vector one can determine the importance of each value. If the angles of the vectors are close to 90 degrees, then the dot product will be close to zero, but if the vectors point to the same direction, the dot product will return a greater value.
Each key has a value associated, and for every new input vector, we can determine how much does this vector relates to the value vectors, and select the closest term using a softmax function.

Transformers have a multihead attention; we can think of it as filters in CNN’s, each one learns to pay attention to a specific group of words. One can learn to identify short-range dependencies while others learn to identify long-range dependencies.This improves the context-awareness, we can understand what terms refer to when it’s not clear; for example, with words such as pronouns.
The transformer architecture facilitates the creation of powerful models trained on massive datasets. Even though it is not feasible for everyone to train these models. We can now leverage of transfer learning to use these pre-trained language modes and fine-tune them for our specific tasks.
Transformers models have revolutionised the field. They have excelled RNN-based architectures in a wide range of tasks, and they will continue to create tremendous impact in the area of NLP.