The world’s leading publication for data science, AI, and ML professionals.

Recent Developments and Views on Computer Vision x Transformer

On the differences between Transformer and CNN, why Transformer matters, and what its weaknesses are.

About this post

This article discusses some of the interesting research and insights in Transformer x Computer Vision research since the advent of Vision Transformer. The four themes of this article are as follows

  1. Differences in receptive fields sizes and behavior between Transformer and CNN
  2. Is Self-Attention essential for Transformer?
  3. Weaknesses of Vision Transformers and directions for improvement
  4. The rapid expansion of the Transformer and why it is so important

As a summary of this article, my view is as follows:

  • Since Vision Transformer, the scope of application of Transformer has expanded rapidly. I personally think that the reason for this is that it can be applied to a wide variety of data and is easy to correlate between different modals.
  • One of the major differences between Transformer and CNN is the wide field of view. Perhaps because of this, transformers are more robust than CNNs to changes in texture and generates different patterns of adversarial patches.
  • According to a recent study, Self-Attention may not be essential in Transformer. In my opinion, it is important to have two parts in the encoder block, one to handle global information and the other to propagate it locally.
  • The weakness of Vision Transformer is that it requires a lot of memory and a lot of data, but it is rapidly improving.

Transformer and Vision Transformer

Transformer

First, I would like to explain the Transformer Encoder used in Vision Transformer, which is a model proposed in the paper "Attention Is All You Need. The title of the paper was provocative to those who had been using LSTM and CNN.

It is neither CNN nor LSTM, but a mechanism called dot-product Attention, and the model (Transformer) that builds on it has outperformed existing methods by a large margin.

Taken from [2], an overview of the Transformer model
Taken from [2], an overview of the Transformer model

There are three variables, Query, Key, and Value, in (dot-product) Attention that are used in the Transformer. Simply put, the system calculates the Attention Weight of the Query and Key words, and multiplies each Key by the Value associated with it.

dot-product attention
dot-product attention

Multi-Head Attention, which uses multiple Attention Heads (in term of MLP, the "number of hidden layers" is increased), is defined as follows. The (Sigle Head) Attention in the above figure uses Q,K as it is, but in the Multi-Head Attention, each Head has its own projection matrix W_i^Q, W_i^K, and W_i^V, and the features projected by these matrices are used to create the Attention.

Multi head Attention. Top left image is taken from [2]
Multi head Attention. Top left image is taken from [2]

The Q, K, and V used in this dot product Attention are called Self-Attention when they are all derived from the same data. This is called Self-Attention when the Q, K, and V used in this dot product Attention are derived from the same data. In Transformer, the part of the Encoder and the part below the Decoder are Self-Attention. The upper part of the Decoder is not Self-Attention because the Query is brought from Encoder and the K,V are brought from Decoder. The following figure shows the example of attention weights. In this figure, the word "making" is used as a query and the Attention Weight for each key word is calculated and visualized. The each keys in attention heads are learning different dependencies. the words of "key" are colored in multiple ways to represent the Attention Weight of each head.

Quoted from [2], the weight of Transformer's Self-Attention. And I add subtitle.
Quoted from [2], the weight of Transformer’s Self-Attention. And I add subtitle.

How Vision Transformer works

Vision Transformer is a model that applies Transformer to the image classification task, and was proposed in October 2020. The contents are almost the same as the original Transformer, but there is an ingenious way to handle images in the same way as natural language processing.

Vision Transformer architecture, quoted from [1].
Vision Transformer architecture, quoted from [1].

The Vision Transformer divides the image into N patches of 16×16 size. Since the patches themselves are three-dimensional data, they cannot be handled directly by the Transformer, which handles language processing. Therefore, after flattening them, they do a linear projection to convert them into two-dimensional data. By doing so, each patch can be treated as a token (like a word), which can be input to the Transformer.

In addition, Vision Transformer uses a pre-training → fine-tuning strategy: Vision Transformer is pre-trained on JFT-300M, a dataset containing 300 million images, and fine-tuned on downstream tasks such as ImageNet. Vision Transformer is the first pure transformer model to achieve SotA performance on ImageNet. This was the beginning of a large increase in research on Transformer x Computer Vision.

Why is Vision Transformer so accurate?

Research on Transformer x Computer Vision has been around for a long time, but it has not been able to achieve SotA performance on ImageNet. The authors interpreted the reason for this in terms of the inductive bias of the model and the number of data. Inductive bias is an assumption that the model has about the data. For example, CNNs process data with 3×3 kernels, which is based on the data assumption that data information is locally aggregated. In RNNs, data at the current time is highly correlated with data at the previous time, but data at the previous time is correlated only through data at the previous time. In RNN, the current time data is highly correlated with the previous time data, but the previous x2 time data is only correlated through the previous time data. This process is also based on the data assumption that data is highly correlated with the previous time. On the other hand, since Self-Attention only correlates each data, it can be said that its inductive bias is relatively low compared to CNN and RNN.

(Left) CNN, which has a strong inductive bias that the information is locally aggregated. (center) RNN, which has a strong inductive bias in that it is strongly correlated with the previous time (right) Self-Attention, which has a relatively weak inductive bias because it only correlates all features.
(Left) CNN, which has a strong inductive bias that the information is locally aggregated. (center) RNN, which has a strong inductive bias in that it is strongly correlated with the previous time (right) Self-Attention, which has a relatively weak inductive bias because it only correlates all features.

The authors interpret strongness **** of ViT as follows: "In situations where there is little data, models with a strong inductive bias are stronger than those with a weak inductive bias because they have assumptions about the data. On the other hand, when there is a lot of data, the assumption becomes a hindrance, so the model with a weak inductive bias becomes stronger in situations with a lot of data". The following figure reinforces this interpretation: Vision Transformer and CNN are compared by the size of the pre-training dataset. In the case of pre-training with JFT-300M, it outperforms CNN (model with strong inductive bias).

data amount and accuracy
data amount and accuracy

Trends in Transformer x Computer Vision

From now on, I will introduce the research trends of Transformer x Computer Vision and the interesting behaviors of Vision Transformer systems that have been found in recent research. There are four themes below. Please note that they contain many of my personal views.

  1. Expanding application areas of Transformer and why.
  2. Differences in the receptive field range and behavior of Transformer and CNN
  3. Is Self-Attention essential for Transformer?
  4. Weaknesses of Vision Transformer and directions for improvement

1. Expanding application areas of Transformer and why

Since the introduction of the Vision Transformer, there has been a rapid increase in the number of studies that have applied the Transformer to various data and tasks, especially in Computer Vision. Vision Transformer was for image classification tasks, but there are also other applications such as Swin Transformer that applied to Semantic Segmentation, object detection [13], and DPT that applied to depth estimation [17]. In the aspect of different data formats , Point Transformer[18] that applied to point cloud data, and Perceiver[19] which can be applied to audio, video, and point clouds as well as images.

In the field of Vision & Languages, which combines the originally transformer used field, natural languages, with Computer Vision tasks, there are also many applications, such as UniT, which can execute multiple tasks such as Computer Vision simultaneously.

As you can see, transformers have been widely used in various fields. Why are transformers being used in this way? In my opinion, there are the following reasons.

1–1. It was found that it can be applied not only to natural language processing but also to image processing, so it has expanded greatly.

1–2. No need to change the network depending on the data, which is convenient.

1–3. It is easy to correlate between different data.

1–1. It was found that it can be applied not only to natural language processing but also to image processing, so it has expanded greatly.

Originally, Transformer was used in the field of natural language processing, and its effectiveness was confirmed in natural language processing and speech recognision, such as the success of the Transformer Transducer[20] in speech. Thanks to Vision Transformer, we now know that it is even better than CNN in the image field. The research field of speech and natural language processing alone is broad. Now, with the addition of Computer Vision, which also has a wide range of research fields, and the ability to use it in complex fields such as speech + images, and Vision & Languages, the scope of application has expanded rapidly.

1-2. No need to change the network depending on the data, which is convenient.

In the past, it was necessary to separate the networks, such as Image -> CNN, Natural Language -> Transformer, but now all the data can be processed by the Transformer, making it easy to use, such as Perceiver [19].

They propose Preciever, a transformer model that can handle high-dimensional input with more than 100,000 features, and can handle many data formats such as video + audio, images, and point clouds. It reduces the amount of computation by retrieving Q from the latent space. Not only did it achieve high performance for images and point clouds, but also obtained SotA performance for video + audio. The images are quoted from [19].
They propose Preciever, a transformer model that can handle high-dimensional input with more than 100,000 features, and can handle many data formats such as video + audio, images, and point clouds. It reduces the amount of computation by retrieving Q from the latent space. Not only did it achieve high performance for images and point clouds, but also obtained SotA performance for video + audio. The images are quoted from [19].

1-3. Easy to correlate between different data.

The fact that we do not need to have a branch network for images and a branch network for natural language, since we can process everything as a Token, including images and language, may be another reason for its growing use. At a low level of layers, we can correlate different modals with Self-Attention. Typical examples are ViLT[21], VATT[22], and VL-T5[23].

By treating data of different modals as tokens, it is easy to take correlations between different data. (top) VL-T5[23], (bottom) VLiT[21].
By treating data of different modals as tokens, it is easy to take correlations between different data. (top) VL-T5[23], (bottom) VLiT[21].

2. Differences in the receptive field range and behavior between Transformer and CNN

Transformer is now as good as or better than CNN in Computer Vision, but what are the differences between the two?

First of all, there is a difference in the field of view: CNNs use 3×3 or 7×7 size kernels, so each layer can only have a corresponding field of view. And the field of view expands as it propagates through the layers, but the expansion is linearly increasing with depth.

The Transformer, on the other hand, uses Self-Attention, which allows the network to see the entire image from the initial layer. Since each patch is treated as a token and all of them are correlated and calculated, it is possible to learn global features from the beginning.

The difference in the size of the field of view when applying CNN (kernel_size=3) and Self-Attention to an image whose size is 32x32. The CNN increases the field of view linearly, while Self-Attention (Vision Transformer) has the entire field of view from the beginning.
The difference in the size of the field of view when applying CNN (kernel_size=3) and Self-Attention to an image whose size is 32×32. The CNN increases the field of view linearly, while Self-Attention (Vision Transformer) has the entire field of view from the beginning.

In fact, the following figure shows the expansion of the receptive field of Vision Transformer. Most of the view is linearly expanding, but some of it is acquiring global information from the initial layer.

Field of view of a trained Vision Transformer. Image taken from [1].
Field of view of a trained Vision Transformer. Image taken from [1].

Also, perhaps due to the size of its field of view, research has shown that the Transformer makes object decisions based on shape than CNN. There is a study [3] that claims that CNN’s decisions are more texture-based than shape-based, which differs from the conventional theory and suggests that CNNs classify objects based on texture rather than shape. For example, in the following figure (c), humans tend to judge a cat based on its shape, but CNN-based models judge an Indian elephant based on its texture.

CNNs are judged by texture rather than shape. Image taken from [3].
CNNs are judged by texture rather than shape. Image taken from [3].

This is quantitatively shown in the figure below. This figure shows the classification accuracy of an image with a perturbation to the texture while preserving the shape. The red points are human, and the blue points are CNN-based models. The human judgments are robust to texture perturbations, while the CNN’s accuracy is greatly reduced. In other words, the CNN criterion is highly dependent on the texture, not the shape.

Quantitative evaluation that CNN is judged by texture rather than shape. Comparison of classification accuracy between a human (red) and a CNN-based model (blue) for an image with perturbations to the texture while preserving the shape. Image taken from [3].
Quantitative evaluation that CNN is judged by texture rather than shape. Comparison of classification accuracy between a human (red) and a CNN-based model (blue) for an image with perturbations to the texture while preserving the shape. Image taken from [3].

Compared to the CNN-based model, the Transformer model (ViT) is relatively more robust to texture perturbations. Compared to the CNN model, the Transformer model (ViT) is relatively robust to texture perturbations [4]. This may have something to do with the wide field of view of the transformer.

Transformer is less texture dependent than CNN. Image taken from [4].
Transformer is less texture dependent than CNN. Image taken from [4].

There are other interesting properties that may be attributed to the size of the field of view. There is a technique called adversarial example [5] that uses intentionally created noise to misjudge the model, and it has been found that there is a difference in the noise between CNN-based models and Transformer [6]. In the noise that causes the dog image to be misjudged, the CNN (ResNet) noise has many high-frequency components and has a local structure. In the other hands, the noise of Transformer (ViT) has a relatively low frequency component and a large structure. This may also be due to the size of the field of view. (It’s very interesting that you can clearly see the boundary in the 16×16 patch.

An example of Adversarial Example. By adding deliberately created noise, the model misinterprets the panda image as gibbon. The image is taken from [5].
An example of Adversarial Example. By adding deliberately created noise, the model misinterprets the panda image as gibbon. The image is taken from [5].
Comparison of adversarial noise; ViT's noise has a large structure with relatively low frequency components, while ResNet consists of high frequency components. The image is taken from [6].
Comparison of adversarial noise; ViT’s noise has a large structure with relatively low frequency components, while ResNet consists of high frequency components. The image is taken from [6].

3. Is Self-Attention Essential for Transformers?

ViT has had great success in Computer Vision, but there is also a lot of research exploring whether there is a better structure than Self-Attention. For example, the MLP-Mixer [7] does not use Self-Attention, but instead uses Multi-Layer Perceptron (MLP), the most basic Deep Learning method, with results comparable to the Vision Transformer. Like Vision Transformer, it requires pre-training on huge datasets such as JFT-300M, but it can achieve high performance without using complex mechanisms such as Self-Attention.

The basic structure is that the Mixer Layer is a block containing MLP1, which processes information across patches, and MLP2, which processes information for each patch, and these blocks are stacked instead of the Transformer Encoder.This method is similar to ViT in that it divides the image into patches and handles the 2D data projection of the patches, replacing ViT’s Transformer with MLP.

Structure of MLP-Mixer. The Mixer Layer is a block containing MLP1, which processes information across patches, and MLP2, which processes information for each patch, and is stacked on top of each other. The image is taken from [7].
Structure of MLP-Mixer. The Mixer Layer is a block containing MLP1, which processes information across patches, and MLP2, which processes information for each patch, and is stacked on top of each other. The image is taken from [7].
Comparison of MLP-Mixer, ViT, and Big Transfer (BiT, CNN-based). Like ViT, MLP-Mixer is inferior to CNN under low data conditions, but exceeds CNN when data is large. The image is taken from [7].
Comparison of MLP-Mixer, ViT, and Big Transfer (BiT, CNN-based). Like ViT, MLP-Mixer is inferior to CNN under low data conditions, but exceeds CNN when data is large. The image is taken from [7].

Influenced by this, a paper [8] appeared claiming that Self-Attention is not an essential element. The gMLP proposed in that paper also proposes a structure using only MLPs, similar to the MLP-Mixer, but with a Squeeze-and-Excitation-like structure, which also has both a part for processing information in the patch direction and a part for processing information in the spatial direction. Even without using a huge dataset such as JFT-300M, the accuracy above EfficientNet can be achieved by training ImageNet alone. It is also important to note that it is as accurate as BERT not only in computer vision tasks, but also in natural language processing.

The structure of gMLP, which is divided into a mechanism for processing information in the channel direction and a mechanism for processing information in the spatial direction. The image is taken from [8].
The structure of gMLP, which is divided into a mechanism for processing information in the channel direction and a mechanism for processing information in the spatial direction. The image is taken from [8].
Results of gMLP. In the area of image recognision, ViT could not surpass CNN without JFT-300M, but gMLP achieved the same results as EfficientNet with ImageNet alone. Also, in the field of natural language processing, gMLP has produced results comparable to BERT. The image is taken from [8].
Results of gMLP. In the area of image recognision, ViT could not surpass CNN without JFT-300M, but gMLP achieved the same results as EfficientNet with ImageNet alone. Also, in the field of natural language processing, gMLP has produced results comparable to BERT. The image is taken from [8].

There is also a study that makes the even more radical claim that Self-Attention can be replaced by a mechanism that has no learning parameters at all, if it can take the entire range of information. In FNet [9], a study in the field of natural language processing, Self-Attention is replaced by a Fourier transform. The Fourier transform does not learn anything because it only changes the basis, and it has a structure that is physically difficult to interpret by adding the features before and after the Fourier transform, but even with such a structure, it produces reasonable results. In this paper, the authors argue that it is enough to mix the information between tokens, and they have also experimented with a network using a random matrix (with fixed parameters) instead of Self-Attention, in addition to Fourier transform.

Structure of FNet, where the SelfAttention part of the Transformer is replaced by a Fourier transform. Image taken from [9].
Structure of FNet, where the SelfAttention part of the Transformer is replaced by a Fourier transform. Image taken from [9].

In recent days, there has been a lot of research on networks that can achieve results without using Self-Attention, such as MLP-Mixer, gMLP, and FNet. These three mechanisms process global information (Self-Attention in ViT, MLP1 in MLP-Mixer, Fourier Transform in FNet) and propagate it locally (Feed Foward in ViT and FNet, MLP2 in MLP-Mixer).

Although my personal opinion is mixed, it may be important to have a block structure that not only mixes information between tokens (patches in the case of images) as FNet claims, but also propagates locally after considering the global information.


4. Weaknesses of Vision Transformer and directions for improvement

Although Vision Transformer has achieved great results in surpassing CNN, it has two major weaknesses. However, improvements are rapidly being made and these weaknesses are being overcome.

  • Due to weak inductive bias, JFT-300M (300 million datasets), which is larger than ImageNet (1.3 million datasets), is required to obtain good accuracy
  • Due to the nature of Self-Attention, it requires a memory size equal to the fourth power of the length of the image.

Due to the weak inductive bias, JFT-300M (300 million datasets) larger than ImageNet (1.3 million datasets) is required to obtain good accuracy. Vision Transformer was able to surpass CNN because of its weak inductive bias, but on the other hand, the accuracy will decrease if the amount of data is not large enough to take advantage of the weak inductive bias. Specifically, we have not been able to surpass CNN-based models without 300 million data sets. To overcome this, various improvement methods have been proposed.

Using CNNs

There is an attempt to reduce the amount of data required by using CNNs: DeiT [10] uses a knowledge distillation framework, where CNNs are used as the teacher model and knowledge is fed to the Transformer model. By doing so, even using only ImageNet, the results exceed not only ViT but also EffcientNet. It has also been reported that the tendency of the judgments came closer to CNN largely by distillation.

(Left)Conceptual diagram of the knowledge distillation performed in DeiT (Right)results of DeiT, a ViT that performed knowledge distillation using CNNs as a teacher model . The image taken from [10].
(Left)Conceptual diagram of the knowledge distillation performed in DeiT (Right)results of DeiT, a ViT that performed knowledge distillation using CNNs as a teacher model . The image taken from [10].

Also, since Vision Transformer deals with local information in a very simple way by linearly projecting 16×16 size patches, there is research to use CNNs there, which are more resistant to local information.

(Left) ViT uses patchy embedding representation in the Transformer, while CeiT uses abstracted embedding representation with CNN convolution. (Right) CNNs are placed inside the transformer to make it more robust to local feature acquisition. Images are taken from [11],[12] and I add annotations.
(Left) ViT uses patchy embedding representation in the Transformer, while CeiT uses abstracted embedding representation with CNN convolution. (Right) CNNs are placed inside the transformer to make it more robust to local feature acquisition. Images are taken from [11],[12] and I add annotations.

Due to the nature of Self-Attention, it requires a memory size that is the fourth power of the edge length of the image. Because Self-Attention calculates the correlation between all patches, it requires a memory size proportional to the fourth power of the edge length. This makes it difficult to handle high resolution images. However, there are studies that have addressed this problem by adopting a hierarchical structure, such as CNN, where the initial layer handles high-resolution images and the resolution is gradually reduced as it gets deeper. For example, PVT [14] uses a high-resolution to low-resolution hierarchical structure like CNNs, while Swin [13] not only uses a hierarchical structure, but also narrows the field of view of Self-Attention to reduce the patch size and obtain more detailed information.

(Left) ViT, with fixed resolution and low resolution. (center) PVT, which can handle high resolution information at first by gradually decreasing the resolution. (right) Swin Tranformer, which handles high-resolution information by gradually expanding the field of view like a CNN, without using the entire area of attention.Images were taken from [13], [14].
(Left) ViT, with fixed resolution and low resolution. (center) PVT, which can handle high resolution information at first by gradually decreasing the resolution. (right) Swin Tranformer, which handles high-resolution information by gradually expanding the field of view like a CNN, without using the entire area of attention.Images were taken from [13], [14].

Other Improvements

In addition to improvements related to CNNs, various other improvements unique to Transformer have been proposed. For example, the T2T module [15] proposes to mix the image embedding with the surrounding patches (tokens), allowing overlap. The Swin Transformer [13] uses local attention, unlike ViT, which uses global attention, but the group of attention is varied from layer to layer.

In addition, ViT has a problem of generating the same kind of attention map as it gets deeper. In "Towards Deeper Vision Transformer", the authors focused on the point of view that the diversity of attention among heads is maintained, and introduced a parameter to mix the attention maps of different heads, and succeeded in maintaining the diversity of attention maps even after deepening.

(Left) Considering that image tokenization (embedding) in ViT is too simple, they propose a T2T module that allows duplication and re-tokenization by mixing the surrounding tokens. (Upper right) Unlike ViT, which uses global attention, Swin takes local attention within the red frame, but propagates it by changing the group of attention at each layer. (Bottom right) Introducing a learning parameter that mixes the attention of different heads improves the diversity of attention.
(Left) Considering that image tokenization (embedding) in ViT is too simple, they propose a T2T module that allows duplication and re-tokenization by mixing the surrounding tokens. (Upper right) Unlike ViT, which uses global attention, Swin takes local attention within the red frame, but propagates it by changing the group of attention at each layer. (Bottom right) Introducing a learning parameter that mixes the attention of different heads improves the diversity of attention.

Vision Transformer used to have problems such as "needing a lot of data" and "large memory size required", but many improvements have been proposed in the past few months. CNN is still the mainstream in practical use, but it may not be long before it replaces CNN in practical use as well.


Conclusion

In this article, I have discussed some interesting research and insights in Transformer x Computer Vision research since the advent of Vision Transformer. My summary of this article is as follows. I am very excited to see what future research will be done on the Transformer.

  • Since Vision Transformer, the scope of application of Transformer has expanded rapidly. I personally think that the reason for this is that it can be applied to a wide variety of data and is easy to correlate between different modals.
  • One of the major differences between Transformer and CNN is the wide field of view. Perhaps because of this, transformers are more robust than CNNs to changes in texture and generates different patterns of adversarial patches.
  • According to a recent study, Self-Attention may not be essential in Transformer. In my opinion, it is important to have two parts in the encoder block, one to handle global information and the other to propagate it locally.
  • The weakness of Vision Transformer is that it requires a lot of memory and a lot of data, but it is rapidly improving.

— – – – – – – – – – – – – – – – – – –

🌟 I post weekly newsletters! Please subscribe!🌟

Akira’s Machine Learning News – Revue

— – – – – – – – – – – – – – – – – – –

About Me

Manufacturing Engineer/Machine Learning Engineer/Data Scientist / Master of Science in Physics / http://github.com/AkiraTOSEI/,

Akihiro FUJII – テクニカルリード – 株式会社エクサウィザーズ/ExaWizards Inc. | LinkedIn

Twitter, I post one-sentence paper commentary.

— – – – – – – – – – – – – – – – – – –

Reference

  1. Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale Alexey. arXiv(2019)
  2. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. arXiv(2017)
  3. Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv(2018)
  4. Shikhar Tuli, Ishita Dasgupta, Erin Grant, Thomas L. Griffiths. Are Convolutional Neural Networks or Transformers more like human vision? arXiv(2021)
  5. Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy. Explaining and Harnessing Adversarial Examples. arXiv(2021)
  6. Srinadh Bhojanapalli, Ayan Chakrabarti, Daniel Glasner, Daliang Li, Thomas Unterthiner, Andreas Veit. Understanding Robustness of Transformers for Image Classification. arXiv(2021)
  7. Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy. MLP-Mixer: An all-MLP Architecture for Vision. arXiv(2021)
  8. Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le. Pay Attention to MLPs. arXiv(2021)
  9. James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon. FNet: Mixing Tokens with Fourier Transforms. arXiv(2021)
  10. Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv(2020)
  11. Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, Lei Zhang. CvT: Introducing Convolutions to Vision Transformers. arXiv(2021)
  12. Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, Wei Wu. Incorporating Convolution Designs into Visual Transformers. arXiv(2021)
  13. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv(2021)
  14. Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv(2021)
  15. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, Shuicheng Yan. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet. arXiv(2021).
  16. Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, Jiashi Feng. DeepViT: Towards Deeper Vision Transformer. arXiv(2021)
  17. René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. Vision Transformers for Dense Prediction. arXiv(2021)
  18. Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip Torr, Vladlen Koltun. Point Transformer. arXiv(2020)
  19. Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira. Perceiver: General Perception with Iterative Attention. arXiv(2021)
  20. Qian Zhang, Han Lu, Hasim Sak, Anshuman Tripathi, Erik McDermott, Stephen Koo, Shankar Kumar. Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss. arXiv(2021)
  21. Wonjae Kim, Bokyung Son, Ildoo Kim. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision. arXiv(2021)
  22. Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text. arXiv(2021)
  23. Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal. Unifying Vision-and-Language Tasks via Text Generation. arXiv(2021)

Related Articles