Swin/Vision Transformers — Hacking the Human Eye

One ML architecture to rule them all? Perhaps not…

Published in

Towards Data Science

24 min readJan 17, 2022

When computer vision meets NLP — Photo by Dmitry Ratushny on Unsplash

The colossal squid possibly has the largest eyes that have ever existed in the entire history of the animal kingdom. They live at great depths and sightings are rare. It is only in recent years that these beautiful eyes could be studied. Squids fall under the class of cephalopods along with the octopus and are invertebrates (no backbone). The Great Teacher was quite fascinated with one particular aspect of these creatures' eyes. Invertebrates, as a rule, do not have an advanced vision system, unlike vertebrates who have eyes very similar to ours. The sole exceptions are cephalopods. Their eyes have many unique features — the most interesting one being the fact that it very closely resembles the architecture of the vertebrate retina.

Now, this is interesting because their retina evolved independently from vertebrates (separated by millions of years). The fact that nature independently produced two immensely complex pieces of vision architecture that are very close to each other is amazing in itself, but there is one peculiar difference. The cells which are sensitive to light are on the inside and the cells which do all the calculations are at the back of them rather than “inside out” — as in the case of the human eye. The Great Teacher used to end his lecture by saying something to the effect that barring the one minor error that Mother Nature straightened out the second time around, it converged upon nearly the same complex vision architecture from a completely different starting point!

A third kind of vision architecture came into prominence in 2012 — Convolutional neural networks(CNNs) began to routinely break the records for vision computing year after year. Though the software roots had been laid much earlier, the hardware had taken a while to catch up. The lower layers of CNN start by detecting edges and contours, the subsequent layers learn more complex patterns until finally, the last layers look at the picture as a whole. The visual cortex ventral of vertebrates are somewhat similar (monkeys were studied) — It is a layer-like pathway of the sequence LGN-V1-V2-V4-IT consisting of multiple information processing layers. As the information flows through the visual pathway, the features learned become more complex, just as in the CNN. The most interesting resemblance is the receptive visual field size. In both cases, it increases across layers as it aggregates more and more pieces of information about the picture. This intuitively makes sense — to recognize an object like a car, the network has to first recognize simple features like edges and then corners and contours and then aggregate these to form shapes like wheels, windows, bonnet, etc before aggregating all those and concluding that the object is likely to be a car.

Thus it somewhat felt natural when Computer Vision models leveraging CNN started achieving human-level performances on tasks such as image classification. After all, weren’t CNNs inspired by the Vision architecture that Mother Nature had so carefully evolved over eons, twice !! It, therefore, came as a shock to many when in Oct 2020, Andrej Karpathy tweeted about a new and (then) anonymous paper which showed that a totally different type of architecture — one based on transformers could be trained to yield better results than CNN on vision tasks!

Now, this is amazing in many respects. Let us first try to understand why transformers should just not have behaved the way they just did (when I say transformers, I refer to the original ViT architecture initially and later focus specifically on Swin which is the State of Art at the time of writing this article. A very basic understanding of Attention is assumed. An idea of how transformers work in NLP would help, but is not a prerequisite for most of the article)

Transformers were specifically architected for the neural machine translation problem in NLP. Each component of the transformer was carefully chosen for a specific reason. Intuitively, it does not make sense to take this architecture which was hand-crafted with such meticulous design to the task of NMT and then apply it to a totally different domain like vision. The researchers of ViT, on the other hand, had other ideas. They seemed to be particularly determined to deviate as little as possible from the NLP transformer. They even treat the image as a 1D sequence and do not embody 2D spatial information in the i/p data fed to the transformer!
Naive application of self-attention to images would require that each pixel attends to every other pixel. With quadratic cost increase, this model is just not scalable. (The ViT team solved this by using patches of size 16 X 16 instead of individual pixels to solve this problem but more on that later)
More importantly, by its very design, locality, 2D neighborhood structure, and translation equivariance are baked into each layer of CNN. When you look at a pic of a smiling photo of your friend, you will recognize your friend irrespective of whether your friend is standing to the left or to the right or in the bottom or top half of the frame. You will recognize your friend even if the angle of the pic is different, the lighting is dimmer or if the photo is inverted, zoomed, etc, no matter what the variation is. This is invariance. Like your eyes, CNNs are translation-invariant (though equivariance which has a slightly different meaning is a better word to use). If they weren’t, we could spend enormous amounts of time training them on a task, but they will end up badly performing if the image is varied even in a small way. The other aspect that CNN respects is locality. Every CNN operation only takes into account a local region of the image. This intuitively makes sense. The image is understood by starting with small receptive fields focussing on low-level local information and then aggregating these across layers until a complete image can be understood. It just doesn’t make sense for a first layer to try and start connecting the pixels between 2 far corners of an image. Yet both these strong features which have been baked into a CNN at every level by design are almost discarded by the transformer architecture!

We see 3 good reasons why transformers should not have behaved the way they did. Yet the ViT paper achieved SOTA results (as of early 2021). The third point above is a particularly strong one. Transformer architectures (almost) do not make implicit assumptions about the locality or 2D structure or about translation equivariance. The self-attention layers are global right from the word go (and indeed it can be seen that the model is trying to make connections between patches from one part of the image to another seemingly unrelated part far away).

The SOTA results show that Transformers seem to be very generic machines. The reason they work in vision is because they learn these biases automatically during training even though they make no presumptions beforehand. This makes them multi-purpose. This makes them powerful. It is like finding an architecture capable of learning any task on any type of input data — a ‘REAL Artificial Intelligence’, oxymoronic as that sounds. One architecture can now be used for any combination of tasks be it image, text, time-series, videos, etc. They can be trained with more data now than ever before. These architectures can connect pieces of information across totally different domains and come out with a startling level of understanding like never before. A unified architecture for ML is now possible and we could see exciting developments in this area in the next few years.

“An AI program should be able to deal with images, text or a combination of both, it should be able to do tackle insane amounts of data, do simple linear algebra, or crunch video, audio or time-series data from diverse domains and carve out data patterns from them; it should be scalable, resilient to attacks and should be generic enough to perform without any priors. Specialisation is for insects!“
— AI equivalent of the human in Robert A. Heinlein’s Time enough for love)

We took a while to set the stage and background needed for the rest of the article. We now focus for a short while on ViT before moving onto Swin which is the current SOTA. Towards the end, we revisit the generic nature of the transformers and see what potential they could possibly hold.

The Essence of ViT and how they hold against CNN models:

The ViT architecture is straightforward…simply because the authors have tried to deviate very minimally from the original NLP transformer architecture (the encoder part). In NLP, the model takes in 1D word tokens as inputs. Here too the model does the same. How does an image get converted to a 1D series? Well, they do it by cutting up the image into patches of 16 by 16 pixels and feeding it to a linear layer which transforms each patch into an embedding. The linear layer applies the same weights across all patches. Position information is learned by the model itself and has been shown to be remarkably close to the actual position of the patch in the original image. Why couldn’t this 1D or 2D information about the position of the patch be fed to the model along with the patch embedding? Well, the authors say there is no benefit in doing so as the model anyway learns this information itself.

Similar to BERT’s class token, the authors prepend a learnable embedding to the sequence of patches. This class token serves as a summary of the image representation. A classification head is then attached to predict the class. This classification head is a simple MLP (with one hidden layer in pre-training else if fine-tuning there is no hidden layer). This is the essence of ViT. There is a minor procedure (best-practice) of fine-tuning the trained model at higher resolutions but we need not be concerned with that now. The remaining NLP transformer-block components like residuals, LayerNorm, multi-head self-attention exist here too with no major differences. In other words, there is pretty much no difference between this and the original transformer architecture by Vaswani et al.

Let us now understand why this model works and how it compares to CNN:

Because ViT is not inherently designed to look for locality or translation equivariance (at least not at every layer, however — don’t forget the MLPs), it needs to learn on its own that these features are important when dealing with “image inputs”. Of course, we say “image inputs” here from a human perspective. To the transformer, they are just embeddings and could come from a word token or an image patch. CNNs on the other hand are designed by default to appreciate locality and translation equivariance. So transformers have weaker (inductive) biases compared to CNN. The ramification of this difference is that transformers need far more data to learn (these biases) and start performing well. On smaller datasets, CNN’s beat transformers but on larger ones, the power of transformers start to show. The ViT paper had an interesting observation tucked inside. They explored hybrid models — Basically, instead of using a linear projection to convert the patches to an embedding, they used convolutions to generate feature maps thus harnessing the power of both models. While one might intuitively expect convolutional local feature processing to improve performance for any data size, this did not happen for larger models.
Let us look at another interesting difference that may sound counter-intuitive initially. The claim is that transformers are more like human vision compared to CNN. Let us analyze the perspective from which the authors make this statement. We discussed how CNNs work by aggregating local information as it moves from lower to higher levels, increasing the receptive field of vision till it is able to analyze images as a whole. Yet, reports to the contrary kept propping up every now than then. For e.g., one study showed that CNNs can continue to classify images perfectly, even if the global shape structure is destroyed (but the texture is kept intact). Others showed that CNNs were bad at the opposite… they could not function effectively if object shapes are preserved but the texture was messed up. One study showed that CNNs with constrained receptive field sizes throughout ALL layers were able to reach high accuracies. All these seem to indicate that CNNs tend to give more weightage to local textures while making predictions. Since they were not explicitly designed to do so, it can only be assumed that the model must have found a shortcut during training. Theoretically, it should then be possible to force the model to train harder by augmenting data in such a way that texture alone is no longer enough to make predictions as in this wonderfully written paper. The authors start by modifying a tabby cat’s texture. It continues to be a cat to the human eye but becomes an elephant to CNNs.

Source: https://arxiv.org/pdf/1811.12231.pdf

They then go on to show that texture bias in standard CNNs can be overcome and changed towards a shape bias if trained on a suitable data set. This increases the model's robustness. ViTs, on the other hand, seem to demonstrate a higher shape bias by default, without any special data augmentation. It was in this narrow context that the authors from the first paper made the claim that ViTs mimic human vision more than the CNNs — This is a fair statement because humans too give more importance to shape rather than texture. Why don’t transformers take the lazy route of using textures rather than shapes is an interesting one. The reduced inductive biases could be at play.

3. This opens up a larger discussion w.r.t the ability to handle “adversarial images” — images that have been altered with a carefully calculated addition of (what looks to us like) noise, such that the altered image looks the same to a human but is treated completely different by the model. The Panda example is a classic one.

Source: https://arxiv.org/pdf/1412.6572.pdf

These kinds of distortions could arise in natural settings too — In a severe thunderstorm, the outline of a human in front of an autonomous car should carry far more weightage than the texture. By their natural tendency towards shape bias, vision transformers seem to be more inherently robust to (certain types of) image distortions, despite not being explicitly trained for it. Whether other factors are also at play is an area of active research. It must be kept in mind that adversarial studies have mainly focussed on exploiting the texture bias of CNNs which were the ruling models until now. Models with shape bias are not necessarily immune to such attacks. For e.g. even with a distinct shape bias, the human visual system can be fooled. While the right balance between texture and shape could be the key, there may be other secret ingredients that the transformer architecture might have discovered!

Cat or Dog? Original paper: https://arxiv.org/pdf/1802.08195.pdf — (I feel that) A subtle change in the shape of the snout makes all the difference

In spite of all their advantages, ViT Transformers, while good at classification, have a major limitation when used in other areas like semantic segmentation, image restoration, etc. As they divide the input image into small patches with fixed sizes and process each patch independently, certain information along the patch borders could be lost. More importantly, they could fail at tasks that need fine-grain analysis of pixels inside patches. Newer models would come along to fix these soon, but it seems the trend of using CNN models as the vertebra for vision tasks seems to be finally broken and transformer-based architectures can possibly be a new backbone (pun intended). The stage is set for a paper that takes it further along this new path.

SWIN’NING with the tide:
The researchers here break up the image into windows (not to be confused with patches which continue to be there). Self-attention is then applied between all patches within each window but not outside it. This tweak brings in immense computational improvements, especially for large image sizes. But how do they derive the global picture? Well, this is the crux of the paper — they use a shifted windowing scheme to allow for cross-window attention connections. This shift results in a hierarchical feature map that provides a better global representation of the image. The original paper is written in a simple language and the same content has been replicated in multiple places, so to keep things short & interesting, let us directly take a trained swin (any model) and analyze what it does. When a child is born, parents spend a lot of time thinking about what to name their labour of love. Let us consider the model named swin_large_patch4_window7_224_22kto1k. It has an interesting name and gives away a lot of information about itself.

Obviously, this is a Large version of the model. It is trained on more parameters. For this version, the channel number(C) of the hidden layer of the first stage is 192. In simple words, C is the size of embeddings when the image patches are initially converted to a 1D token. Each patch is thus represented by a 192 dim embedding.
The ‘S’ in swin stands for ‘S’hifted — perhaps the authors specifically want to emphasize that this is different from ‘S’liding (and more efficient as well). We will see this shifting in action soon.
‘Win’ — stands for the concept of windows which limit the attention and make the model scalable which makes it Win competitions. It may also be a subtle way of linking this latest benchmark in AI with the other well-known brand from the same org.
22k — 1K: This model is trained on Imagenet-22K consisting of 14 mil images and 22K class labels. It is then Fine-tuned on Imagenet-1K for a small number of epochs. We have seen this type of approach earlier in ViT.
224: This stands for the input image size. It is 224 x 224 across 3 channels for this model
Patch size:4. The concept of breaking the image into patches is the same as in ViT. So the incoming image = 224x224x3 is broken into 4x4 size patches i.e. each patch will therefore have 48-pixel values including the RGB colors. Each of these pixel values is a feature. As discussed earlier, these 48 features are converted to a 1D linear embedding of size C = 192. We will have 224/4 x 224/4 such patches. Each of these 3136 patches will be converted to a corresponding token with 192 dimensions. So far, we observe that the swin has not deviated too much from ViT.
Window 7: In swin, we have windows arranged to evenly partition the image in a non-overlapping way. In other words, the 224 x 224 image is broken into non-overlapping windows such that each window contains M×M patches. In this model, we have 7*7=49 patches per window. So we have a total of 8x8 or 64 windows for the whole image. We already discussed that each patch is of 4x4 pixels. The Math adds up perfectly in this case but if it doesn’t, and an M is chosen which cannot be broken into non-overlapping windows, some sort of padding happens at the corners (the bottom-right one).
Attention is done locally — within patches inside each window boundary. In our case, attention happens between the 49 patches in each of the 64 windows. The attention is local i.e each patch attends only to the 49 patches in its window (including itself), so 2401 dot products. Notice that, within a window, the query key set has 49 values only and this does not vary. This common key set facilitates memory access in hardware. With a sliding window, this is not the case. The query keys would be different with each slide. The second advantage is that since the number of patches in each window is fixed, the complexity becomes linear to image size, not quadratic as in the case of ViT. While we have loosely used the term ‘attention’ for simplicity’s sake, in reality, this is a customized multi-head self-attention block (which we will examine shortly) and there are several such blocks applied. The dimensions and the number of tokens don’t change across these blocks. We are done with Stage 1.
The Shifting process now kicks in. It is intelligently designed to allow adjacent windows to interact with each other. This is done by merging patches in an interesting way. Since all self-attention calculations are completed in step 8, we can (for some time) forget about the concept of windows and instead, just talk about the 56x56 patches that we have in the image. By the end of step 8, each of these 56*56 patches (3136 in all) are encoded in C=192 dimensions. Now merging happens. The first patch merging layer concatenates the features of each group of 2×2 neighboring patches. The dimensions thus increase to 4C. A linear layer is now applied such that the dimension is scaled back to 2C. The net effect of all this is that the number of patches (or tokens which is a better term to use going ahead) is reduced by 1/4 and the embedding size is doubled. In other words, the embedding size of each token is now 2*192 and we have reduced the number of tokens to 28x28 (i.e. 784) for the whole image.
Now, the self-attention blocks are applied again. Of course, when we talk of self-attention, we need to bring back the windowing scheme. Given the current sizes, we can have 4x4 i.e. 16 windows each having 7x7 patches. However, note that each patch is now double the original patch. The self-attention blocks preserve the sizes and after a series of self-attention blocks, we end up with the same count of 28x28 i.e 784 patches for the whole image with an embedding size of 2*192. This is stage 2.
The entire procedure is repeated twice, as “Stage 3” and “Stage 4”, with output resolutions of H/16 * W/16 and H/32*W/32 while the embedding size increases by 2 each time. For the model we chose, after stage 4, we end up with 224/32*224/32 ie 7*7 tokens each of an embedding size of (192*2)*2*2 ie 1536 dimensions. What next? Well, there is no next because our chosen window size M = 7*7 tokens and that is exactly what we are left with now. The shifting process ends. The image has been broken into 49 tokens, each of 1536 embeddings.
The last layer of Swin is a simple (adaptive) average pooling followed by a Norm. The image has now been successfully converted into one representation with 1536 embeddings.
A simple classification head is attached to convert these 1536 embeddings into the right class!

The beauty of Swin’s design lies in its simplicity. As mentioned, the self-attention blocks are no longer quadratic w.r.t image size, yet the global picture is not lost. There are a couple of very nice aspects that we want to pay particular attention to:

When the patch merging occurs, the window border automatically widens as compared to the previous boundaries. Every boundary is shifted ahead. For the sake of illustration, the authors take a simple case of M=4 and a small image that can be broken into 8*8 patches i.e 64 patches. The first windowing scheme (image on the left) is straightforward. We have 4 windows each with 4*4 patches. Attention is done, stage 1 is completed and now we move to stage 2, and merging starts to happen. We discussed that the merging process concatenates the features of each group of 2 × 2 neighboring patches. When we now apply the windowing scheme, this results in each (old) window border being pushed ahead by 2 patches. The new border scheme is shown in the image on the right below.

Source: https://arxiv.org/pdf/2103.14030.pdf

Perhaps, the merging process is better understood in the below illustration. This is how the image on the above right is derived.

Though not shown in the original paper, it must be noted that post the 2*2 merging of patches we are now left with 16 patches across the image.

Now comes a very important optimization done by swin. It is an optimization only and should not be confused with the merging process described above. To calculate attention in the new windowing scheme, they don’t use padding but instead re-arrange the blocks smartly such that they continue to have only 4 windows to deal with. Self-attention within windows is now calculated. However, due to the rearrangement of blocks, some windows may need patch-masks to ensure that attention does not span unrelated blocks. Post-self-attention calculations, the blocks are re-arranged. This small hack provides neat savings in compute power

Another aspect that needs to be noted is w.r.t positional embedding. Like in ViT, the patch position information is learnt by the model itself and not provided to it. They use a relative position bias approach while calculating self-attention to achieve this effect.
Attention = SoftMax (Q.K_transpose / √d + B) V
Since this approach is not unique, we will not discuss much. However, basically, the intuition is to extend the concept of self-attention to the relative distance as well. Since Q, K, and V have the shape <M_square, d>, the relative position bias B above has the shape <M_square, M_square>. Since we know that within an M*M patch window (i.e. 7*7 patch window), a patch can at most be 6 patches away from another patch i.e. the entire relative distance range can only vary between -6 to +6 thru’ 0, they parametrize a much smaller matrix of 13*13 dimensions and use the values of this smaller matrix in the B matrix.

Return of the biases

The astute reader may have noted that Swin transformers have brought back some of the inductive biases discarded earlier. The paper itself is unapologetic — “While recent models abandon translation invariance…,…we find that inductive bias that encourages certain translation invariance is still preferable”. There is no doubt that swin beats other models in terms of accuracy in certain tasks, but a key question is — Has the addition of this bias (to whatever limited extent), reversed some of the advantages associated with transformers discussed earlier w.r.t robustness to adversarial attacks? It is too early to tell but it seems that there is already evidence to the contrary. Swin seems to perform exceptionally well w.r.t Corruption Robustness. Interestingly, the same study also showed that the shape bias of a SWIN-T-SMALL was 27.43 and ALEXNET 29.80 with nearly the same parameters. Thus, Swin models seem to have a shape bias far lower than ViT while not suffering from some of the disadvantages of CNN models w.r.t. to being vulnerable to (certain types of) adversarial attacks due to excessive reliance on texture.

Have we stumbled upon the right delicate mix of shape and texture? Further studies can tell! But even if these advantages were not there, transformer-backed architectures will continue to have the ability to combine tokens from different domains (text, images, even time series) during training. This puts them in a different league altogether.

Why did mother nature persist with convolutions? Are there advantages to convolutions that AI models have not exploited yet? Perhaps by studying transformer models very closely we might be able to examine what these factors are. But another aspect to consider is that our AI models focus dis-proportionately on the training aspects. Retinal vision needs to be optimized for inference more than anything else. Then there is the curious phenomenon where forms of sensory processing that we evolved to perform, depend heavily on the data we’re exposed to — for example, cats exposed only to horizontal edges early in life don’t have the ability to discern vertical edges later in life. This suggests that some sort of fine-tuning or even training happens early on in our lives and a ready-to-use model is not handed at birth. If so, convolutions being more thrifty, may be the natural preference. Also since these wirings happen very early on in life, data is in short supply and again convolutions are the go-to solution.

Deep goes the Rabbit-hole

‘Attention’ started in 2014 when a paper by Bahdanau et al was published. The authors show great improvements in NMT results by allowing the model to automatically search for parts of a source sentence that are relevant to predicting a target word. The word ‘attention’ was used in only one short para in the entire 15-page paper. Nonetheless, the term caught on, and various refinements were introduced over the next 4 years culminating in the landmark 2017 paper ‘Attention is all you need’ where Vaswani et al. write ‘We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. The authors had obviously meant the convolutions employed in NLP but this historic sentence is worth re-viewing given the developments that unfolded over the next 4 years.

Transformers have turned out to be pretty generic architectures as we discussed. A sequence goes in with no particular order and a sequence comes out magically extracting all the relevant patterns from the input. They have weak inductive biases. They seem to have an inherent ability to capture useful patterns in any type of data regardless of the domain. They are scalable. Their performance seems to increase with depth. The components from which they are assembled ensure that rank collapse does not happen easily (though it was not known at that time). How deep can they go? Well, ViT showed clearly visible improvements up to 64 layers or so. Techniques such as re-attention could help transformers go deeper.

The first inkling about the generic nature of transformers (that I experienced) actually did not come from ViT or vision but from the time-series transformer models just prior to that. It became increasingly effective to use transformers for time-series problems like the Influenza Prevalence Case, with some tweaks to handle longer sequences, though they were originally not architected for this domain. Kaggle time-series competitions in 2020/21 like RiiD or the interesting Ventilator pressure prediction competition (a plain English summary here) were dominated by transformer models. The results from ViT and Swin only re-affirm this fact further. Far more surprising is the application of transformers to other domains like linear algebra computations as in this paper where transformers were trained to perform matrix transposition, find eigenvalues and vectors, singular value decomposition, and inversion. More importantly, these trained models seemed to be able to generalize out of their training distribution pretty well.

Bahdanau’s profile page talks about his involvement in developing the “core tool used in deep-learning-based NLP.” He now only has to update this sentence to include audio, video, images, time-series, and nearly all domains — such has been the far-reaching impact of ‘Attention’.

Vision as controlled hallucinations

There was an interesting tweet by Karpathy two weeks back. He was talking about how ML models are consolidating and improvements in any one domain can quickly be cut-pasted into other domains. But it was a follow-up tweet that caught my attention. Karpathy says (when talking of transformers and their similarity with the human neocortex) ‘Perhaps nature has stumbled by a very similar powerful architecture and replicated it in a similar fashion, varying only some of the details’. The neocortex is the part of the human brain responsible for higher-order functions like sensory perception, cognition, and language. It has a very similar structure thru’ out and has been hypothesized to be uniformly composed of general-purpose data-processing modules with a standard pattern of connectivity between them.

There are reasons to believe that this structure embodies a basic computational module so remarkably general, versatile, and flexible that it can be simply repeated & combined for any higher-order task. This may explain how humans perform forms of sensory processing we haven’t evolved to perform — blind people can learn to see with their tongues, and can learn to echolocate well enough to discern density and texture. This means (if the above hypothesis is true) that the human brain has the same ‘basic circuitry’ for all higher-order functions including perception, cognition, and language…which is the very direction that transformer-based architectures are heading! I however believe that nature did not stumble upon this model but arrived upon it by evolutionary design. A more pertinent question is — Did humans stumble upon the transformer architecture and accidentally discovered its usefulness? A more Wachowskian question is — Would we have discovered it even if we were not looking for it?

Fun aside, let us re-visit the historic moment in Dec 2017 when the face of NLP was changed forever by Vaswani and team. Were they aware of the far-reaching impact their paper would have beyond the world of NLP? The closing statement of their paper is very forward-looking — “We plan to extend the Transformer to problems involving input and output modalities other than text and to investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs such as images, audio, and video”, but did they actually believe that within less than 4 years all this would come true? Only the researchers themselves can tell!

<For a plain English discussion on the neocortex circuitry, I would recommend this highly interesting blog and the links within. Let me whet your appetite with this sentence —” The cortex has been hypothesized to consist of canonical microcircuits that implement predictive coding. The cortex learns the hierarchical structure of the data it receives and uses this structure to encode predictions about ‘future sense inputs’, resulting in ‘controlled hallucinations’ that we interpret as a direct perception of the world.”>

‘In general, we’re least aware of what our minds do best, … we’re more aware of simple processes that don’t work well than of complex ones that work flawlessly’ — Marvin Minsky

“Specialization is for insects” ? Nooo…

For a long, long while (all of four years) it was believed that attention was the secret sauce to transformers. Yet recent research has shown that it is quite possible to replace the costly self-attention computations with other approaches ( for e.g. MLP). What about the other carefully chosen components like skip connections, norm, etc. Will we see transformers being made even more generic (and more efficient because they definitely need to learn more in that scenario)? Will they become so generic that they will no longer be called architectures but start getting referred to as “generalized data-crunching engines” ? Will one model rule over them all?

The Kaggle Ventilator pressure prediction competition was an interesting one in many respects. We discussed that the competition was dominated by a number of solutions leveraging transformers. Yet, the competition winner was a simple hand-crafted LSTM solution. I believe that while the transformer model continues to evolve to be more generic and continues to share its secrets little by little, it will learn to live alongside and be complemented by small, yet elegant, hand-crafted solutions leveraging beautiful feature engineering on conventional architectures!