In the era of natural language processing, the enhancements in Transformer architecture has been the heart of the revolution with breakthroughs occurring at an unprecedented pace. This article demonstrates the notion of feedback transformers, which overcomes the drawbacks of traditional transformer architecture as well as recurrent Neural Networks, enabling shallower models with faster decoding speed, less memory as well as computations needed and above all, can make use of all the computed information from all previous layers unlike the decoding only transformers which sacrifices much information in order to train in parallel.
Let’s Revise the Transformer Architecture before getting into the weeds of feedback memory!
Transformer architecture in natural language processing aims at solving sequence-to-sequence tasks while handling long-range dependencies with ease. This model completely relies on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.
Let’s have a glimpse of the architecture hereby, to delve deeper its recommended to visit the paper Attention Is All You Need!

Limitations of RNNs and Transformer Architecture
In case of RNN and Transformer architecture, not all the information computed is used although technically, one could make use of it but the sacrifice is made in order to train in parallel. However, the novel feedback Transformer architecture uncovers information from all previous representations to all future representations using shallow models with comparably stronger performance.
Recurrent Neural Networks and their Limitations

In case of RNN, all information flows in top and right directions only and hence there is much information loss. Also, often the information has to flow along many computations in order to be combined with some other information.
Transformers and their Limitations
Transformer architecture has witnessed many limitations such as its ineffectiveness to track long sequences as well as processing hierarchical inputs. Some of the paramount constraint of this architecture being:
(A) Limited Access to Higher Level Representations
(B) Transformers cannot maintain an internal state for long time if it has been frequently updated

Introducing Transformers with Feedback Memory
In order to defeat the limitations of Transformer architecture, the concept of feedback memory is introduced. As we could analyse in Figure-3, there are lot of arrows (i.e. connections) for every hidden representation and this amount of attention connections could explode any system. In order to overcome this, feedback memory mashes all the information for a particular time step to a single memory representation. The succeeding layers instead of looking at individual representations of previous layers, now only focusses upon the single memory representation.

The noteworthy benefits of storing the information to corresponding memory representation resides in usage of overall less memory as well as fewer computations as this approach shares the key and value computation of the attention mechanism further leading towards reduced GPU memory usage owing to memory sharing.

Advantages of Feedback Transformer Architecture
The following are the major contributions in performance enrichment by introducing feedback memory to transformer architecture:
- Ability to accomplish better performance with small, shallow models

On examining result from Figure-6, it can be concluded that as the decoding speed increases, owing to the shallower networks i.e. having less layers in the architecture, the Transformers show lot more fall in performance as compared to the feedback transformer.
2. Feedback Transformers converges to reach higher average reward in reinforcement learning as compared to Transformers

Reinforcement learning tasks in general requires long memory to solve the tasks optimally. It can be diagnosed from Figure-7 that the Feedback Transformers converges to reach higher reward at any training step as compared to the Transformer architecture.
Conclusion
Feedback Transformers despite using substantially less memory during training and inference time accomplishes in achieving stronger performance compared to Transformer architecture of same size with smaller and shallower models with faster decoding speed. Retaining long range information flow as well as accessing available higher representations information immediately are some of the major contributions of transformers with feedback memory.
References:
[1] Fan, Angela, et al. "Addressing Some Limitations of Transformers with Feedback Memory." arXiv preprint arXiv:2002.09402 (2020).
[2] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
[3] Feedback Transformers by Yannic Kilcher.
Hoping that you enjoyed reading this article as much as I did writing it!
If you found the content meaningful, do check out my other articles here.