
Introduction
Facebook AI Research has come up with a new architecture called Expire-Span [1] which sort of deletes contextual data which is deemed to be irrelevant information in the Attention [2] mechanism. Going by intuition, it should work well because we don’t want irrelevant information to be modelled and remembered especially when the sequence lengths are quite long these days. Let’s get into the details quickly because, the Expire-Span is training faster, using less memory, and can scale to very large sequence lengths without a drop in performance!
Background
Attention Mechanism took the Deep Learning world by storm finding itself in a plethora of domains mainly NLP and Computer Vision. It is excellent in modelling long sequences very well and has set the benchmarks in most of the NLP tasks. Recently it made its way in Computer Vision – Vision Transformers [3] and GANs – Generative Adversarial Transformers [4] and TransGAN [5].
The performance comes at a cost – Heavy memory & compute requirements! The research has now shifted from Attention mechanisms to reducing the usage of attention modules in the architectures because of its quadratic increase in computation when scaled with longer sequences. The MLP-Mixer showed us how to get away with the attention mechanism in Computer Vision tasks. I have already written a blog on MLP-Mixer [6], find it here. But doing away with attention in NLP tasks might be disastrous as the Attention mechanism has time and again proven that they are the best. There were several improved variants of Transformers like Adaptive-Span [7], Compressive Transformers [8], and Performers [9] which have tried to reduce the computation and memory overhead for longer sequences. Feel free to have a read about them. It is very interesting to study how the research is following around Attention mechanism in general.
In the above-listed architectures, one thing is constant,
All memories are treated equally without the regards to their importance to the task
quoted by the authors of Expire-Span as they set out to build an architecture that involves Transformers but enables the architecture to forget the irrelevant things for that particular task. This will reduce the memory and computational overhead significantly resulting in a faster architecture and at times better than conventional attention-mechanism-based transformer architectures. The architecture they are comparing is Transformer XL [10] which is dividing the sequences into blocks and attending to the past from a readily available vector which is storing the attention information from the past blocks. Also used in the comparison are Adaptive Span and Compressive Transformer.
Note: A good understanding of the Attention mechanism is recommended for this article.
Idea & Architecture
Is this time-step worth retaining or not? If not, then why not delete this and give more importance to the context which is more important and adding value to the task?
This is the main idea that led to the conceptualization of Expire-Span. Deal-breaker isn’t it? Let’s try to understand the idea with a diagram.

We can see some time steps _h_1, _h_1, _h_3, and so on till _h_t and 5 tokens with an expiration constant e. This expiration value ‘e‘ denotes the no. of time steps after which the context information is deleted. In the above figure, _e_2 and _e_3 are not valid information as per the model and at time step ht, those are not factored or attended into the calculation at that particular step t. They are simply not adding enough contextual information going further and are deleted.
Conventionally transformers are attending to all the previous tokens for contextual information which is O(n²) computationally for a sequence length of n. If we consider a small example sentence "The dog sitting on the map", we don’t want to carry the context of words the which is not adding much information and instead remember words dog and sitting for the word mat. This will lead to complexity less than O(n²), which is pursued hotly nowadays. Expire-Span is doing exactly this, giving more time steps for words "dog" and "sitting" than the word "the". There will be a lower value of e for "the" than "dog". This expiration of tokens is quite different from the fixed window attention which is sliding over the sequence and attending to a fixed length of tokens.
This is closely related to both Transformers and LSTMs [11] where we are attending to all tokens irrespective of their importance and smartly forgetting and retaining things in a sequence respectively. We can call Expire-Span a hybrid version of both 😛
The Expire-Span value e is calculated by the formula given below,

where _h_i is the memory step, w and b are trainable parameters and L is the maximum span. This eventually will determine for how long _h_i should be kept in the context (Ct). There is one more equation _r_ti = _e_i − (t − i). Whenever the _r_ti value is negative, the _h_i is expired and can be removed from Ct. The way it is implemented inside an attention layer is by a binary mask. We all know that there is an attention matrix that involves masking of values in the upper triangle while training transformers. This masking is extended to the lower triangle of the matrix too. We need to mask those cells also which are expired for that particular token in the attention matrix. The formula for this binary mask function _m_ti is given below,

This is not differentiable and there will be no gradients to train and hence a soft maxing function was used with the formula shown below.

In the above formula R is a hyperparameter which is the length of the ramp bounded by 0 to 1. This will ensure that the mask is decreasing linearly after expiring at a particular time step. This is the gradient required to tune the e value during the training.

There is also an L1 penalty added to the expire-span value e to decrease the span of memories that are not contributing to the main task which results in smaller memory usage which is highly focusing on the relevant information.
Accommodation of Expire-Span in the existing Self-Attention Layers of Transformers
The Expire-Span mechanism can be included within a self-attention layer with a few minor changes described below.
- The heads of self-attention share the same underlying memory and hence expire-span can be calculated per layer and shared across the heads
- There is an adoption of a parallel block mechanism here which enables parallel computation and the hidden state values are cached and can be attended by future blocks
- The chances of over-fit are high when the network can scale in tens of thousands in sequence length and this is countered by randomly shortening the memory span during training as a regularizing factor
- The L value in the Expire-Span formula can take a very huge number and can exert a high influence on e. For very large values of L, the formula is modified as shown below to stabilize the training

Results & Benchmarks
The model is trained on a variety of tasks like Corridor Task, Portal Task, Instruction Task (all are reinforcement learning tasks), Extreme Long Copy, Character level modelling, and Frame by Frame processing: Object Collision task. I will not go into detail about these tasks as the article will become too long and exhaustive. Instead, I will share one highlight of Expire-Span and few numbers which justify the usage of Expire-Span over other architectures for the above-mentioned tasks.

As we can see from the above diagram the model can remember proper nouns Egypt and Humpty Dumpty and if Egypt is replaced by the word "somewhere", it is not given higher attention which will eventually not be attended by the future tokens (It is forgotten faster).

Expire-Span almost matches the performance levels of Transformer XL, Compressive Transformer, and Adaptive-Span with relatively low time/batch training time and GPU memory. Similarly, there are benchmarks given for Extreme long copy, Corridor and Instructions tasks too. Expire-Span is performing better than other models in terms of accuracy, memory, and scalability.
My Thoughts
The model must give an e value right when it encounters a word. The expire value might not be effective in modelling the first word throughout the document. Say, I have a word which is a name and very much important right till the end of a document. If the model gave a limited expire value, the contextual information for that word might be lost entirely.
At times you might infer in such a way that – This word is not that important going forward in this text. But after a while, that word which you forgot becomes very important and you try to remember where you had first seen it and what was its context then. The model assigns the e value independent of what happens in the future.
Conclusion
The Expire-Span is an architecture that can forget information not useful and irrelevant for the given data and task. This results in lower memory and compute requirements than the conventional transformers, enabling the architecture to scale to the lengths of tens and thousands! Despite consuming lesser memory and compute, it can match and few cases exceed the performance of state-of-the-art transformer architectures like Transformer XL, Adaptive Span, and Compressive Transformers. The research in this space is heading towards reducing or eliminating the attention mechanism for scalability, memory, and compute requirements (The carbon footprint left behind by training massive language model which uses an attention mechanism is huge and we need to think about that also).
Efficient Transformers: A Survey [12] is on the best ways to keep up with all the research which happened in the Transformers space over the years. I would highly recommend a read to this paper.
P.S: For more exciting technical blogs on AI, IoT and Bio visit https://appliedsingularity.com/blog/!!!
References
[1] Not All Memories are Created Equal: Learning to Forget by Expiring: https://arxiv.org/pdf/2105.06548.pdf
[2] Attention is all you need: https://arxiv.org/pdf/1706.03762
[3] ViT (Vision Transformer): https://arxiv.org/pdf/2010.11929
[4] Generative Adversarial Transformer: https://arxiv.org/pdf/2103.01209
[5] TransGAN: https://arxiv.org/pdf/2102.07074
[6] MLP-Mixer: https://arxiv.org/pdf/2105.01601
[7] Adaptive Span: https://arxiv.org/pdf/1905.07799
[8] Compressive Transformers: https://arxiv.org/pdf/1911.05507
[9] Performers: https://arxiv.org/pdf/2009.14794
[10] Transformer XL: https://arxiv.org/pdf/1901.02860
[11] LSTM: https://www.bioinf.jku.at/publications/older/2604.pdf
[12] Efficient Transformers: A Survey: https://arxiv.org/pdf/2009.06732.pdf