Going back to MLPs for image processing, simple but effective (with competitive results)

Image processing is one of the most interesting subareas in machine learning. It started with Multi-layer perceptions, then convolutions, then self-attention (transformers), and now this paper brings us back to MLPs. If you are like me, your first question would be how would an MLP achieve almost the same results as transformers and CNNs? This is the question that we will be answering in this article. The new proposed "MLP-Mixer" achieves very close results to the SOTA models trained on tons of data with almost 3x the speed. This was also an interesting metric in the paper (images/core/sec).
The proposed MLP-Mixer doesn’t use any convolutions or any self-attention layers, and yet achieves almost SOTA results, this is quite thought-provoking.
The MLP-Mixer architecture
Before discussing how the network works, let’s start off by discussing the components of the network, and then put them all together (like breaking down a car engine and analyzing the pieces).
We present MLP-Mixer, an architecture based exclusively on multi-layer perceptrons (MLPs). MLP-Mixer contains two types of layers: one with MLPs applied independently to image patches (i.e. "mixing" the per-location features), and one with MLPs applied across patches (i.e. "mixing" spatial information).
Source: MLP-Mixer on arxiv
The first thing to note here is how the input image is "modeled/represented", it is modeled as patches (as it was split) x channels. The first type of layer (will call it channel-mixing layer [1]), operates on independent patches of the image and allows communication between their channels (hence channel-mixing) during learning. However, the second type (will call it patch-mixing) works in the same way but for patches (allows communication between different patches).
The core idea of modern image processing networks is to mix features at a given location or mix the features between different locations [1]. CNNs perform both of those different types of mixing with convolutions, kernels, and pooling while Vision Transformers perform them with self-attention. However, the MLP-Mixer tries to do both in a more "separate" fashion (explained below) and only using MLPs. The main advantage of only using MLPs (which are basically matrix multiplications) is the simplicity of the architecture and the speed of computation.
How does it work?
This is the interesting part where we discuss how the input becomes an output, and what happens to the image whilst going through the network.

The first fully connected layer projects the non-overlapping patches into a desired hidden dimension (according to the size of the layer). The type of this layer is the "patch-mixing" layer which makes sense. You can think of this as encoding the image, this is a widely used compression trick in neural networks (as autoencoders) to reduce the dimensionality of the image and only keep the most crucial features. After this is done, a "table" is constructed with the values of the image patches against the hidden dimension values.
The patch-mixing layer performs its matrix operations (such as transposing) on the columns of the table, while the channel-mixing layer performs its matrix operations on the rows of the table (this is represented above as the "Mixer Layer"). After that, a non-linear activation function is applied [1]. This might sound a bit confusing, but intuitively, you can see that the mixer is trying to find the best way to mix & encode the channels and patches of the image into a meaningful output.
One important point to note here is that the size of the hidden representation of the non-overlapping patches is independent of the number of input patches. Proving this here will make this article much longer than I wanted so feel free to check the paper. But, essentially this gives out a very important difference in performance between the MLP-Mixer and other architectures which is:
The computational complexity of the MLP-Mixer is linear in the number of input patches, unlike ViT whose complexity is quadratic.
Source: MLP-Mixer on arxiv
The MLP-Mixer also has a few advantages that provide a lot of simplicity to its architecture:
- The layers have identical size
- Each layer consists of only 2 MLP blocks
- Each layer takes input of the same size
- All image patches are projected linearly with the same projection matrix
This makes reasoning about this network and working with it a bit simpler compared to CNNs which typically have a pyramidal structure [1]. I remember my first time trying to design a CNN, figuring out when to scale down the image, when to scale it up, and the degree by which you scale it up/down can be a bit difficult. However, these problems aren’t present in this architecture.
One thing to note is that the model also uses skip connections and regularisation, but I don’t think we need to discuss these concepts since they are widely used and explained in many other resources.
Final thoughts

In terms of results, there are multiple tables, this one highlights the fact that the performance of the Mixer is quite similar to other architectures, however, it’s quite faster. It has a "throughput" of 105 image/sec/core compared to 32 for the Vision transformer. In all fairness, this may sound like quite a weird metric, and it sometimes feels like ML researchers try to find one metric that makes it seem like their networks are much better than others. However, I think we can all objectively agree that achieving the same level of performance with only MLP-blocks is still impressive.
I hope I provided a good balance of low-level details and high-level details without causing too much confusion. Let me know down in the comments if there is anything you didn’t understand and I will do my best to break it down further.
Thank you
Thank you for reading thus far 🙂 If you enjoyed this article, please consider buying me a coffee here (coffee helps me write):
If you want to receive regular paper reviews about the latest papers in AI & Machine Learning, add your email here & Subscribe!
https://artisanal-motivator-8249.ck.page/5524b8f934
References