Speeding up Transformers training speed by up to 7x with the magic of Fourier transforms

By replacing the attention sublayer with linear transformations, we are able to reduce the complexity and memory footprint of the Transformer architecture. We show that FNet offers an excellent compromise between speed, memory footprint, and accuracy, achieving 92% of the accuracy of BERT in a common classification transfer learning setup on the GLUE benchmark (Wang et al., 2018), but training seven times as fast on GPUs and twice as fast on TPUs
Source: Arxiv
Recent ML papers have been targeted at fiddling with transformer layers. It is quite interesting to see what works and what doesn’t (even though we probably only see what works from those papers). Due to the significant usage of transformers, I think the last 6–12 months have been about optimizing them. From changing the layers to reducing self-attention layers, to reducing the transformer network sizes greatly. This paper review is going to talk about changing the layers to improve the training speed, and the most interesting part is that it was done using Fourier transforms.
What are Fourier transforms (FTs)?
Fourier Transform is a mathematical concept that can decompose a signal into its constituent frequencies. Fourier transform does not just give the frequencies present in the signal, It also gives the magnitude of each frequency present in the signal.
Source: TDS
Fourier transform is one of the most used transforms in signal processing. It essentially "wraps" a signal wave of a graph around a circle in a specific way such that you can more easily extract signal properties and features (and thus process the signal). If you are a math nerd like me, you must have heard of Fourier transforms. But, what intrigued me is what does that have to do with neural networks? Transformers typically deal with NLP, not signal to process, and this is mostly why I chose to review this paper.
However, to my surprise, it appears that those Fourier transforms have had many applications in the field of deep learning such as [1]:
- Solving Partial Differential Equations (using Neural Networks!)
- Speeding up convolutions
- Stabilizing Recurrent Neural Networks (Fourier RNN and ForeNet)
Why change self-attention layers?
Although self-attention blocks result in higher accuracy, and provide useful contextual meaning to the network, they are computationally expensive. The research here shows some very important findings—the need for those self-attention blocks is saturated fairly quickly. After reviewing multiple recent transformer optimization papers, I started noticing this trend. Self-attention is quite powerful, but it is essentially becoming a bottleneck and there is a race to replace it without affecting performance.
In fact, in a recent paper by Nvidia, they found that for most architectures, it’s more efficient in terms of performance to have less than 50% of the layers as self-attention blocks, while for the Transformer-XL they suggest around 20%. Furthermore, those blocks should be only in the first 2/3 of the network (not interleaved throughout the network). This results in 63% fewer self-attention blocks and 35% lower compute time.
The FNet paper examines a different way to mix the transformer tokens (which is what self-attention does) and proposes a new linear method to do so using FTs. Without further ado, let’s dive into it
The FNet architecture

The architecture is actually quite simple and that’s why this section isn’t going to be that long. The only difference between this architecture and the classic architecture is that the self-attention layer of the encoder is replaced with a Fourier layer [1] that applies a 2D Fourier transformation to the input sequence length and the hidden dimension. The main intuition behind doing this is that the Fourier transform might be able to provide a better way for mixing the input tokens (through its transformation).
From a computer science perspective, transformers have quadratic complexity (O(N²)) due to the non-linearity and the complexity of self-attention. However, Fourier transforms are linear and thus offer a much better complexity (and an increase in computation speed). Moreover, the computation of Fourier transforms is more optimized on GPUs/TPUs compared to the self-attention computation [1]. Furthermore, the Fourier transform layer doesn’t have any learnable weights unlike self-attention, and thus uses less memory (because the model size goes down).
Brief Results
Okay, this sounds like a simple enough change, it’s just a change to one type of layer in the network. Let’s see the possible effects of such a change.


Note that F-Net has slightly slower accuracy than BERT, however it’s much faster in training steps and in memory usage. And this is one of the things that I liked about this paper is that they are showing both the pros and cons of their approach (there are no magic trade-offs anyway!).
Final thoughts
In all fairness, even if it achieved the same performance I would have still been interested in this paper simply because it’s offering quite a novel approach to replacing self-attention layers. Personally, my main reason for reviewing these papers and writing about them is to provoke your thoughts, show how optimizations are being made by large companies. Before reading this paper, I would have never thought that Fourier transforms can be used to optimize transformers. I guess this pushes us to be a bit more well-rounded as data scientists, expand our knowledge horizons because you never know where 2 different ideas might interleave and become one.
On another note, smaller models are becoming much more important simply due to the need for offline machine learning (where the models are deployed on small devices) and this paper brings us a step closer towards smaller Transformers. I think it’s important to realize the direction in which the machine learning/Data Science industry is heading to.
Finally, if you are interested in Fourier Transforms (which you should be), I suggest checking out this amazing video by 3Blue1Brown:
If you want to receive regular paper reviews about the latest papers in AI & Machine Learning, add your email here & Subscribe!
https://artisanal-motivator-8249.ck.page/5524b8f934
References:
[1] FNet on Arxiv