The world’s leading publication for data science, AI, and ML professionals.

Feedback Transformers from Facebook AI

Addressing Some limitations of Transformers with Feedback Memory

Photo by Paul Cuoco on Unsplash
Photo by Paul Cuoco on Unsplash

In the era of natural language processing, the enhancements in Transformer architecture has been the heart of the revolution with breakthroughs occurring at an unprecedented pace. This article demonstrates the notion of feedback transformers, which overcomes the drawbacks of traditional transformer architecture as well as recurrent Neural Networks, enabling shallower models with faster decoding speed, less memory as well as computations needed and above all, can make use of all the computed information from all previous layers unlike the decoding only transformers which sacrifices much information in order to train in parallel.


Let’s Revise the Transformer Architecture before getting into the weeds of feedback memory!

Transformer architecture in natural language processing aims at solving sequence-to-sequence tasks while handling long-range dependencies with ease. This model completely relies on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Let’s have a glimpse of the architecture hereby, to delve deeper its recommended to visit the paper Attention Is All You Need!

Figure-1: Transformer Architecture Comprehensive View
Figure-1: Transformer Architecture Comprehensive View

Limitations of RNNs and Transformer Architecture

In case of RNN and Transformer architecture, not all the information computed is used although technically, one could make use of it but the sacrifice is made in order to train in parallel. However, the novel feedback Transformer architecture uncovers information from all previous representations to all future representations using shallow models with comparably stronger performance.

Recurrent Neural Networks and their Limitations

Figure-2: Intuitive Study of RNN architecture and its Drawback
Figure-2: Intuitive Study of RNN architecture and its Drawback

In case of RNN, all information flows in top and right directions only and hence there is much information loss. Also, often the information has to flow along many computations in order to be combined with some other information.

Transformers and their Limitations

Transformer architecture has witnessed many limitations such as its ineffectiveness to track long sequences as well as processing hierarchical inputs. Some of the paramount constraint of this architecture being:

(A) Limited Access to Higher Level Representations

(B) Transformers cannot maintain an internal state for long time if it has been frequently updated

Figure-3: Intuitive Study of Transformers and their Drawbacks
Figure-3: Intuitive Study of Transformers and their Drawbacks

Introducing Transformers with Feedback Memory

In order to defeat the limitations of Transformer architecture, the concept of feedback memory is introduced. As we could analyse in Figure-3, there are lot of arrows (i.e. connections) for every hidden representation and this amount of attention connections could explode any system. In order to overcome this, feedback memory mashes all the information for a particular time step to a single memory representation. The succeeding layers instead of looking at individual representations of previous layers, now only focusses upon the single memory representation.

Figure-4: (Left) The Feedback Transformer merges past hidden representations from all layers into a single vector and stores it in memory. (Right) Difference between Feedback and Transformer. t indicates the timestep and l indicates the layer.
Figure-4: (Left) The Feedback Transformer merges past hidden representations from all layers into a single vector and stores it in memory. (Right) Difference between Feedback and Transformer. t indicates the timestep and l indicates the layer.

The noteworthy benefits of storing the information to corresponding memory representation resides in usage of overall less memory as well as fewer computations as this approach shares the key and value computation of the attention mechanism further leading towards reduced GPU memory usage owing to memory sharing.

Figure-5: Explanation to the difference in computation of next layer representation in Transformer versus Feedback Transformer
Figure-5: Explanation to the difference in computation of next layer representation in Transformer versus Feedback Transformer

Advantages of Feedback Transformer Architecture

The following are the major contributions in performance enrichment by introducing feedback memory to transformer architecture:

  1. Ability to accomplish better performance with small, shallow models
Figure-6: Machine Translation on WMT14 En-De, test set BLEU and decoding speed in words-per-second for varying decoder depths.
Figure-6: Machine Translation on WMT14 En-De, test set BLEU and decoding speed in words-per-second for varying decoder depths.

On examining result from Figure-6, it can be concluded that as the decoding speed increases, owing to the shallower networks i.e. having less layers in the architecture, the Transformers show lot more fall in performance as compared to the feedback transformer.

2. Feedback Transformers converges to reach higher average reward in reinforcement learning as compared to Transformers

Figure-7: Maze Navigation in Gridworld. Display of average reward comparing Feedback Transformer to standard Transformers.
Figure-7: Maze Navigation in Gridworld. Display of average reward comparing Feedback Transformer to standard Transformers.

Reinforcement learning tasks in general requires long memory to solve the tasks optimally. It can be diagnosed from Figure-7 that the Feedback Transformers converges to reach higher reward at any training step as compared to the Transformer architecture.

Conclusion

Feedback Transformers despite using substantially less memory during training and inference time accomplishes in achieving stronger performance compared to Transformer architecture of same size with smaller and shallower models with faster decoding speed. Retaining long range information flow as well as accessing available higher representations information immediately are some of the major contributions of transformers with feedback memory.

References:

[1] Fan, Angela, et al. "Addressing Some Limitations of Transformers with Feedback Memory." arXiv preprint arXiv:2002.09402 (2020).

[2] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).

[3] Feedback Transformers by Yannic Kilcher.

Hoping that you enjoyed reading this article as much as I did writing it!

If you found the content meaningful, do check out my other articles here.


Related Articles