Thoughts and Theory
Vision Transformer (ViT) has been gaining momentum in recent years. This article will explain the paper "Do Vision Transformers See Like Convolutional Neural Networks?" (Raghu et al., 2021) published by Google Research and Google Brain, and explore the difference between the conventionally used Cnn and Vision Transformer.
The abstract of this paper and the content of this blog
There are six central abstracts in this paper: ResNet (He et al.,2016) as a representative of CNN-based networks. Residual_Learning_CVPR_2016_paper.html)) and ViT as representative of CNN-based networks.
- ViT has more similarity between the representations obtained in shallow and deep layers compared to CNNs
- Unlike CNNs, ViT obtains the global representation from the shallow layers, but the local representation obtained from the shallow layers is also important.
- Skip connections in ViT are even more influential than in CNNs (ResNet) and substantially impact the performance and similarity of representations.
- ViT retains more spatial information than ResNet
- ViT can learn high-quality intermediate representations with large amounts of data
- MLP-Mixer’s representation is closer to ViT than to ResNet
In this blog, I will first briefly review the structure of ResNet and ViT, which are representative examples of CNN-based models, and then take a closer look at the differences in the obtained representations described in this paper.
ResNet Basics
ResNet is a prevalent model for computer vision(CV) tasks. As shown in Figure 2 below, the weighted propagation side of ResNet does a summation with a skip connection that skips a layer of weights. The summation process with skip connections alleviates problems such as gradient vanishing and allows for deeper layers than the previous network.

Vision Transformer (ViT) Basics
First, let’s look at the transformer encoder used in Vision Transformer (ViT).
Transformer
The Transformer is a model proposed in the paper "Attention Is All You Need" (Vaswani et al., 2017). It is a model that uses a mechanism called self-attention, which is neither a CNN nor an LSTM, and builds Transformer model to outperform existing methods significantly. The results are much better than the existing methods.
Note that the part labeled Multi-Head Attention in the figure below is the core part of the Transformer, but it also uses skip-joining like ResNet.

The attention mechanism used in the Transformer uses three variables: Q (Query), K(Key), and V (Value). Simply put, it calculates the attention weight of a Query token (token : something like a word) and a Key token and multiplies the Value associated with each Key. In short, it calculates the association (attention weight) between the Query token and the Key token and multiplies the Value associated with each Key.

Defining the Q, K, V calculation as a single head, the multi-head attention mechanism is defined as follows. The (single-head) attention mechanism in the above figure uses Q and K as they are. Still, in the multi-head attention mechanism, each head has its projection matrix _W_i^Q, W_i^K, and W_i^V_, and they calculate the attention weights using the feature values projected using these matrices.

If the Q, K, V used in this attention mechanism are calculated from the same input, it is specifically called Self-Attention. On the other hand, the upper part of Transformer’s decoder is not a "self-" attention mechanism since it calculates attention with Q from the encoder and K and V from the decoder.
The image of the actual application is shown in the figure below. The figure shows a visualization of the attention weights calculated for each Key token using the word "making" as a query. The transformer uses a multi-headed self-attention mechanism to propagate to later layers, and each head learns different dependencies. The Key words in the figure below are colored to represent the attentional weight of each head.

Vision Transformer (ViT)
Vision Transformer (ViT) is a model that applies the Transformer to the image classification task and was proposed in October 2020 (Dosovitskiy et al. 2020). The model architecture is almost the same as the original Transformer, but with a twist to allow images to be treated as input, just like natural language processing.

First, ViT divides the image into N "patches" of such as 16×16. Since the patches themselves are 3D data (height x width x number of channels), they cannot be handled directly by a transformer that deals with language (2D), so it flattens them and makes a linear projection to convert them into 2D data. So each patch can be treated as a token, which can be input to the Transformer.
In addition, ViT uses the strategy of pre-training first and then fine-tuning. ViT is pre-trained with JFT-300M, a dataset containing 300 million images, and then fine-tuned on downstream tasks such as ImageNet. ViT is the first pure transformer model to achieve SotA performance on ImageNet, and this has led to a massive surge in research on Transformers as applied to computer vision tasks.
However, training ViT requires a large amount of data. Transformers are less accurate with less data, but become more accurate with more data, and outperform CNNs when pre-trained on the JFT-300M. For more details, please refer to the original paper.

Comparing ResNet and ViT in terms of the expressions we are getting
So far, we have seen an overview of ResNet and ViT, both of which can perform well in image recognition tasks, but what is the difference between them? In the paper "Do Vision Transformers See Like Convolutional Neural Networks?", the authors study it.
Let’s take a closer look at each of the following six points, as mentioned in the introduction.
- ViT has more similarity between the representations obtained in shallow and deep layers compared to CNNs
- Unlike CNNs, ViT obtains the global representation from the shallow layers, but the local representation obtained from the shallow layers is also important.
- Skip connections in ViT are even more influential than in CNNs (ResNet) and substantially impact the performance and similarity of representations.
- ViT retains more spatial information than ResNet
- ViT can learn high-quality intermediate representations with large amounts of data
- MLP-Mixer’s representation is closer to ViT than to ResNet
1. ViT has more similarity between the representations obtained in shallow and deep layers compared to CNNs
One of the major differences between ViT and ResNet is the large field of view of the initial layer.

CNNs (ResNet) only have a fixed-size kernel field of view (size 3 or 7). Specifically, CNNs gradually expand the field of view by repeatedly "convoluting" the information around the kernel layer by layer. In contrast, ViT uses a self-attention mechanism that allows the model to have a whole field of view even at the lowest layer. In this way, the visual field differs depending on the structure of the network.
The figure below shows the actual field of view (effective distance of the self-attention mechanism) of ViT. In the shallow layer, there are some parts with local field of view like CNN, but there are also many heads with the global field of view.

So, what are the structural differences between ResNet and ViT in the representations acquired at each layer depth? To find out, the authors plotted the similarity of the acquired representations for each layer in the figure below (Figure 1).

In the figure above, they plotted the similarity of the representations obtained for each layer using a measure called CKA similarity (I won’t go into the technical details of CKA similarity, so please refer to the original paper if you want to know more about it.) The diagonal component of the figure is naturally high because of its similarity to itself, but let’s look at the other parts of the figure.
First of all, in ViT (two on the left), we can see that the overall coloration suggests that similar representations are being acquired regardless of the depth of the layers. On the other hand, in CNN (two on the right), we notice that there is no similarity between the representations acquired in the shallow and deep layers. This may be due to the fact that in ViT, we get the global representation from the beginning, while in CNN, we need to propagate layers to get the global representation.
The similarity between ViT and ResNet is plotted in the figure below (Figure 2).

We can see that the similarity between ViT’s layers 1 to 40 and ResNet’s layers 1 to 70 is high. So ResNet takes 70 layers to acquire the representation ViT takes 40 layers to acquire. This means that the method of acquiring a representation in shallow layers is very different. In addition, the similarity between the deep layer of ViT and the deep layer of ResNet is low. So the abstract representation of image is very different between ViT and ResNet.
Incidentally, some studies have been motivated by the fact that ViT does not benefit from deepening because of the similarity of self-attention map (Zhou et al., 2021]). This study focuses on the high diversity between heads and proposes a mechanism called Re-Attention, which introduces a learning parameter to mix features between different heads. They have achieved good results using it(DeepViT).

![DeepViT that use Re-Attention benefit from deepening (Zhou et al., 2021])](https://miro.medium.com/1*4XHGiZnzkm8KuiowxNSBXw.png)
2. Unlike CNNs, ViT obtains the global representation from the shallow layers, but the local representation obtained from the shallow layers is also important
The figure below (Figure 3) shows the effective distance of the self-attention mechanism (the average of the distance of self-attention of 5000 data) after pre-training with JFT-300M (300 million images) and fine-tuning with ImageNet (1.3 million images).

In the shallow layer (encoder_block0, 1), we can see that the model is getting both local and global representations, whereas, in the deep layer (encoder_block22, 23, 30, 31), all the representations haveglobal view.
As we saw in the description of the vision transformer, training ViT requires a large amount of data (e.g. JFT-300M), and if the data is insufficient, the accuracy will decrease. The following figure (Figure 4) shows similarities in this case.

If we compare Figure 3 and Figure 4, we can see that we cannot get local representations in shallow layers when the dataset is small. From this result and the fact that "ViT does not achieve accuracy when data is small," we can see that "local representations" obtained by ViT trained with sufficient data have a significant effect on accuracy.
But what is the relationship between the amount of data and the representation auquired? The figure below (Figure 12) illustrates this.

For the shallow layer representation, with about 10% of the data, the similarity to the representation obtained using all of the data is increased to some extent. However, for the deep layer representation, even with 30% of the data, the similarity is lower than 0.2. From this, we can say that the deep layer representation, which contributes to accuracy, can only be learned with a large amount of data. It was mentioned earlier that local representations are important, but it seems that global representations that can be obtained in the deep layers are also important.
Although it is not specified here, the experiment was probably conducted with JFT-300M, so even if we say 3% of the total data, there is still an amount of data of about 10M (about 10 times as much as ImageNet). In my opinion, 30% of the data (100M) is enough to obtain the local representation that should be obtained in the shallow layer, and if there is more data, it is possible to obtain important things in the global representation.
3. Skip connections in ViT are even more influential than in CNNs (ResNet) and substantially impact the performance and similarity of representations
Next, let’s take a look at the relationship between skip-connections and similarity of acquired expressions. This is shown in the figure below (Figure 8).

In the experiment shown in this figure, we calculate the similarity of the acquired representations when the skip connection of a layer i is eliminated. Comparing this figure with the left side of Figure 1 (ViT), you can see that the similar trend of the acquired representation changes drastically after the layer i where the skip connection is eliminated. In other words, skip connection has a significant impact on the representation propagation, and when it is eliminated, the similarity of the layers changes significantly. Incidentally, when skip connection is eliminated in the middle layer, the accuracy drop by about 4%.
Next, let’s take a look at how skip connection plays a role in information propagation. Take a look at the figure below (Figure 7).

In the left panel of Figure 7, _||z_i||/ |f(z_i)|| , the ratio of$z_i, the input information to the self-attention in $i$ layer, and f(z_i), the feature value after the transformation of z_i_ by the self-attention and multilayer network, is plotted for each token (Note that token 0 is a class token, not an image patch.) The larger the ratio, the more information is propagated through skip-joining; Figure 7 on the left shows that class tokens are propagated through skip-joining in the initial layer, while images are propagated through self-attention and multilayer networks. The trend is reversed in the deeper layers.
The right figure shows a comparison with ResNet. The green line is ResNet, while ViT has a larger value, indicating that information propagation through skip junctions plays a major role.
Although it was not specifically mentioned in the paper, the fact that skip-joining plays a major role in information propagation may have caused the accuracy to drop significantly when the skip-connection in the middle layer was eliminated in Figure 8.
4. ViT retains more spatial information than ResNet
Next, let’s compare how much location information is retained by ViT and ResNet. Look at the figure below.

In this experiment, the authors tested which ViT and ResNet retain positional information by plotting the CKA similarity between a patch of the input image and the final layer feature map at a certain position. If it retains positional information, then the similarity to the patch of the input image at a certain position should be high only at the position corresponding to that position in the feature map.
First, let’s take a look at ViT (top, middle). As expected, in the last layer, the similarity of the corresponding positions is high. This means that ViT propagates the representation while retaining the positional information. Next, let’s take a look at ResNet (bottom row). In this case, the similarity of unrelated locations is higher, indicating that it does not retain location information.
This difference in trend is probably due to the difference in network structure. Look at the figure below (the figure is taken from [Wang et al., 2021](http://Wang et al., 2021)).
](https://towardsdatascience.com/wp-content/uploads/2021/10/12-h_oXk5-qZGSZSY4cxqAQ.png)
ResNet and other CNN-based image classification networks propagate representations with decreasing resolution. For example, ResNet has five stages, each of which halves the resolution, so that the final feature map is 1/32 * 1/32 in size (left figure above). ViT, on the other hand, tokenizes to 16×16 size first, which reduces the resolution in that area, but it propagates to the final layer with that resolution. Therefore, ViT is more likely to retain location information than ResNet. However, to begin with, image classification tasks do not require location information for classification decisions, so it cannot be said that ViT has an advantage over ResNet because location information is retained.
In addition, in recent research, the tactic of gradually decreasing the resolution like ResNet is often used in Vision Transformer related research. For example, the Pyramid Vision Transformer shown on the right above. Transformer systems use self-attention and the memory size occupied increases in proportion to the fourth power of the image size. This makes it difficult to handle large resolutions, but by using a tactic of gradually decreasing the resolution, as in CNN systems, it is possible to handle high resolution information in the first layer while saving memory.
5. ViT can learn high-quality intermediate representations with large amounts of data
Next, let’s look at the quality of the middle layer representation. The following figure (Figure 13) shows the experiment.

In this experiment, the authors are trying to see if they can use the representation of the middle layer to classify with a linear model. The getting more accuracy with a simple model such as a linear model means, the layer gets the better representation.
First, let’s look at the relationship between the size of the data set and the representation obtained (left figure). Here, we compare the results of an experiment on ImageNet (dotted line), which contains 1.3 million images, and JFT-300M (solid line), which contains 300 million images. As you can see, the representation trained on JFT-300M, which is a huge data set, is better. Next is a comparison of the models including ResNet. It can be seen that the larger model produces better representations.
As a side note, in the right figure, the accuracy of the ResNet-based model suddenly increases near the final layer. Why is this?
A study by Frosst and colleagues provides a hint (Frosst et al., 2019). They introduced Soft Nearest Neighbor Loss with a temperature term into the middle layer of ResNet and studied its behavior. Soft Nearest Neighbor Loss indicates the state of entanglement of features by categories. A large value of Soft Nearest Neighbor Loss indicates that features by class are intertwined, while a small value indicates that features by class are separated.


[Soft Nearest Neighbor Loss]
The figure below shows the value of Soft Nearest Neighbor Loss in each block of ResNet. It is known to be a high performance image classification network, but it does not separate the features of each class except in the last layer. In my opinion, this property of ResNet may be the reason for the rapid improvement in accuracy near the last layer, as shown in Figure 13.

6. MLP-Mixer’s representation is closer to ViT than to ResNet
Recently, instead of transformers, highly accurate image classification models have been proposed using multilayer perceptrons (MLPs), i.e., networks with dense layers. A typical example is a network called MLP-Mixer (Tolstikhin et al., 2021). The structure of this network is shown in the figure below.

The MLP-Mixer is a system that mixes information between patches using MLP1, then mixes information within patches using MLP2, and then propagates by piling up blocks that combine the two. This MLP-Mixer can reach the same or higher accuracy as ViT. The following figure compares the representation of the MLP-Mixer in the same way as before. Comparing this figure with Figure 1 and Figure 2, the authors say that the general trend is similar to ViT.
MLP-Mixer propagates the image by dividing it into patches like ViT, so it is structurally closer to ViT than ResNet. This structure may be the reason for such a result.


Conclusion
In this article, I have looked at the differences between ViT and CNN in detail. To recap, here are some of the differences between the two. Transformers will continue to be a major influence in the field of computer vision. I hope that this article will help you understand Transformers.
- ViT has more similarity between the representations obtained in shallow and deep layers compared to CNNs
- Unlike CNNs, ViT obtains the global representation from the shallow layers, but the local representation obtained from the shallow layers is also important.
- Skip connections in ViT are even more influential than in CNNs (ResNet) and substantially impact the performance and similarity of representations.
- ViT retains more spatial information than ResNet
- ViT can learn high-quality intermediate representations with large amounts of data
- MLP-Mixer’s representation is closer to ViT than to ResNet
— – – – – – – – – – – – – – – – – – –
🌟 I post weekly newsletters! Please subscribe!🌟
— – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
Other blogs
Machine Learning 2020 summary: 84 interesting papers/articles
Recent Developments and Views on Computer Vision x Transformer