MetaFormer: De facto need for Vision?

What is the key to the success of Vision Transformers?

LA Tran

Published in

Towards Data Science

4 min readFeb 20, 2022

Vision Transformer

Recently, Transformer has gained popularity since Alexey Dosovitskiy proposed Vision Transformer (ViT) in his paper. ViT has been shown to be able to outperform Convolutional Neural Networks (CNNs) in image classification tasks while requiring fewer computational resources.

CNNs have taken the crown in computer vision for the last decade and a young ViT which for the first time steps over to the field surprisingly defeats CNNs. That yields one big question:

What truly contributes to the success of ViT?

The figure above shows the structure of the Transformer Encoder which is the core of ViT. Years later, many follow-up works have focused on improving performance by completely replacing Multi-Head Attention with other structures like Multi-layer Perceptions (MLPs) and they have achieved competitive performance on image classification benchmarks. This trajectory afterward has attracted many researchers to finding novel attention modules or the token mixer.

MetaFormer

Alternatively, Weihao Yu et al. from the National University of Singapore have argued that the success of Transformer in computer vision mostly relies on its general architecture rather than the design of the token mixer. To verify that, they have proposed to use an embarrassingly simple non-parametric operation, Average Pooling, as the token mixer and still achieved state-of-the-art performances on various computer vision tasks such as image classification, object detection, and instance segmentation.

In the paper, they term the general architecture of Transformer as MetaFormer and they investigated the potency of PoolFormer where the Average Pooling operation is utilized as the token mixer. Surprisingly, PoolFormer outperforms the Transformer architectures that adopt other sophisticated modules like attention and spatial MLP.

PoolFormer

The overall architecture of PoolFormer is shown in the above figure. The model has 4 main stages which have the same design of one patch embedding layer followed by PoolFormer blocks.

The details of each stage can be depicted as in the figure above. Patch embedding is implemented by using a 3x3 Conv layer. Norm layer can be Batch Normalization, Layer Normalization, or Group Normalization; in the paper, the authors have applied Group Normalization as it showed better results in their experiments. Channel MLP module is implemented by adopting two 1x1 Conv layers with an expansion ratio and one SiLU activation layer in between. The official code can be found at: https://github.com/sail-sg/poolformer

Results

PoolFormer has been compared with other state-of-the-art models on various benchmarks of computer vision tasks including image classification, object detection, and instance segmentation. The results are summarized as follows:

Image Classification

Object Detection & Instance Segmentation

Pytorch-like Code for PoolFormer Block

Conclusions

In this post, I have briefly reviewed MetaFormer, a general transformer-based architecture which is truly responsible for the success of transformers and their variants in computer vision, said by authors. The authors proposed MetaFormer based on a hypothesis that the competence of transformers or MLP-like models has been gained by their general structures rather than the token mixer. To prove this hypothesis, an embarrassingly simple non-parametric operator, pooling, is utilized as the token mixer and still outperforms other advanced modules such as attention. The potency of MetaFormer has also been validated on benchmarks of image classification, object detection, and instance segmentation tasks.

Readers are welcome to visit my Facebook fan page for sharing things regarding Machine Learning: Diving Into Machine Learning. Further notable posts from me can also be found here:

Thanks for spending time!