š¤ Deep Learning
MLP Mixer Is All You Need?
Understanding MLP-Mixers from beginning to the end, with TF Keras code
Earlier this May, a group of researchers from Google released a paper āMLP-Mixer: An all-MLP Architecture for Visionā introducing their MLP-Mixer ( Mixer, for short ) model for solving computer vision problems. The research suggests that MLP-Mixer attains competitive scores on image classification benchmarks such as the ImageNet.
One thing that would catch every ML developerās eyes, is that they havenāt used convolutions in their architecture. Convolutions have reigned computer vision since long as they are efficient in extracting spatial information from images and videos. Recently, Transformers, that were originally used for NLP problems, have shown remarkable results in computer vision problems as well. The research paper for MLP-Mixer suggests,
In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary.
There had been some controversies whether MLP-Mixers are āconv-freeā or not. Explore this blog from Weights and Biases to know more,
Weāll discuss more on MLP-Mixerās architecture and underlying techniques involved. Finally, we provide a code implementation for MLP-Mixer using TensorFlow Keras.
Also, this blog has been showcased on the Google Dev Library.
You can now find pretrained MLP-Mixer models on TensorFlow Hub, https://tfhub.dev/sayakpaul/collections/mlp-mixer/1
I have used MLP-Mixers for text classification as well,
š Contents
š¼ Dominance of Convolutions, advent of Transformers
The use of convolutions in computer vision problems was popularized by Yann LeCun and since then convolutions have served as the backbone for computer vision problems. Each filter is convolved over the input volume to compute an activation map made of neurons, as depicted below.
Each neuron in the output map is connected to a specific part of the input volume, which can be observed clearly in fig. 1. The output map is then passed to an activation function ( such as ReLU ). In order to decrease the dimensionality of the output maps, a Pooling operation is used. Convolutions are be used to learn local features in an image, which is the goal of computer vision problems. Nearly all architectures like MobileNets, Inception, ResNet, DenseNet etc. use convolutional layers ( which is convolution + activation ) to learn image features.
Transformers were created for NLP problems, but have shown considerable results in image classification. Iāll leave some resources here for Vision Transformers ( ViTs ),
š¤ Multilayer Perceptron ( MLP ) and the GELU activation function
Multilayer Perceptron ( MLP )
If youāre a experienced ML developer, you might have learnt this in your ancient times.
A multilayer perceptron is a artificial neural network with an input layer, multiple hidden layers and an output layer. Except for input nodes, every node uses a non-linear activation function.
In our case, the research paper suggests an MLP with 2 fully-connected ( dense ) layers with a GeLU ( weāll discuss more on this in the coming sections ) activation function,
Each Mixer layer will consist of two MLPs, one for token mixing and another for channel mixing. We discuss more on token mixing and channel mixing in later sections of the story. Hereās the code which weāll use to stack two Dense
layers ( with a GELU ) activation thereby adding a MLP to the existing layers ( x
),
Note: The dropout wonāt be seen in the figure 3. We add it so as to regularize our model. You can also notice them in this Keras example.
Note, most of us would be thinking that Dense
layers accept inputs of shape ( batch_size , input_dims )
and outputs tensors of shape ( batch_size, output_dims )
. But in our case, these Dense
layers will receive inputs shapes of 3 dimensions meaning of shape ( batch_size , num_patches , channels )
or its transpose ( batch_size , channels , num_patches )
.
Weāll learn more on num_channels
and num_patches
in later sections of story.
GELU ( Gaussian Error Linear Unit ) Activation
Modern but not-so-popular activation function
Gaussian Error Linear Units is an activation function which weighs inputs using the standard Gaussian cumulative distribution function. In case of ReLU ( Rectified Linear Units ), the inputs are weighed using their sign.
The inputs have a high probability of being dropped out as x decreases. The transformation on x is stochastic yet it depends on the value of x.
The reason behind choosing a Normal distribution is that when using a Batch Normalization layer, the outputs of the neurons follow a Normal distribution. The GELU activation is widely used in Transformer models for solving NLP problems.
As observed in snippet 1, weāll use tf.nn.gelu
to add GELU activation to the MLPs. If you want a Keras layer, there is a tfa.layers.GELU
layer in the TensorFlow Addons package.
Hereās a nice blog explaining various activation functions ( including GELU ),
š§ MLP-Mixer Architecture Components
Weāll discuss each component in detail and then integrate all of them in one piece of code.
āļøMixer Layers
Mixer layers are the building blocks of the MLP-Mixer architecture. Each Mixer layer contains two MLPs, one for token mixing and another for channel mixing. Alongside, MLPs, youāll notice Layer Normalization, skip-connections and the āTā written above the arrow. It refers to the transpose* of the tensor, keeping the batch dimension intact.
transpose*: Weāll use the
tf.keras.layers.Permute
layer to perform transposition by settingdims=[ 2 , 1 ]
in the arguments of this layer. We wonāt discuss this in detail in coming sections.
I recommend you to absorb the diagram thoroughly, as Iāll refer it now and then, while discussing the components. In the following sections, we discuss,
- What are Patches ( inputs of the Mixer layer )
- Token Mixing MLPs
- Channel Mixing MLPs
- Layer Normalization
š§½ What are Patches?
And how do we create them using an RGB image ( which are the typical inputs of your MLP-Mixer model )?
A Mixer layer would take in a tensor of shape ( batch_size , num_patches , num_channels )
and also produce an output of the same shape. You might wonder that how we can produce such a tensor from an RGB image ( which is the actual input of the MLP-Mixer model )? Refer to the diagram below,
Suppose, we are given an RGB image of size 4 * 4. We create patches, which are non-overlapping*, using a 2D convolution. Suppose, we need square patches of size 2 * 2. As seen in fig. 9, we can create 4 non-overlapping patches from the 4 * 4 input image ( a patch is shaded for you in the diagram ). Also, using C filters, we transform the input image of size image_dims * image_dims * 3
to a tensor of shape num_patches * num_patches * C
, where num_patches = image_dims / patch_size
. Note, we assume that image_dims
is perfectly divisible by patch_size
.Considering our example *, num_patches = ( 4 * 4 ) / ( 2 * 2 ) = 4
.
In the example above, we have num_patches=2
.
our examples*: Special thanks to our reader Dr. Abder-Rahman Ali for pointing out the mistake in calculation of
num_patches
. We really appreciate his effort towards improving this story.non-overlapping*: In order to create non-overlapping patches, we set
kernel_size=patch_size
andstrides=patch_size
in KerasāConv2D
layer.
Finally, we resize the patches of shape num_patches * num_patches * C
to num_patches^2 * C
.
š§± Token Mixing MLPs
As discussed earlier, each Mixer layer consists of a token-mixing MLP. We would like to understand the meaning of tokens, which is highlighted in the paper as,
It [ MLP-Mixer ] accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a āpatches * channelsā table as an input, and maintains this dimensionality.
Hereās the code for token-mixing MLPs ( the role of LayerNormalization
and Permute
could be observed from fig. 8 )
As the name suggests, it mixes tokens or in other words, allows communication between different patches in the same channel. As observed in fig. 11, the number of channels C isnāt modified and only P i.e. the number of patches is expanded to some dimension ( token_mixing_mlp_dims
) and brought back to P.
š§± Channel Mixing MLPs
Channel mixing MLPs do a job similar to token mixing MLPs. They mix channel information thereby enabling communication among channels.
As observed in fig. 12, the number of patches P isnāt modified and only C i.e. the number of channels is expanded to some dimension ( channel_mixing_mlp_dims
) and brought back to C.
āļø Layer Normalization
which is different from batch normalization
Batch Normalization uses the mean and variance of the whole batch to normalize activations. In case of layer normalization ( especially for RNNs ), the mean and variance of all summed inputs of a neuron are used to perform normalization. As mentioned in the paper,
In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.
The TF-Keras team provides a tf.keras.layers.LayerNormalization
layer to perform this operation. Here are some resources to understand Layer Normalization,
Now, with the complete knowledge of Mixer layers, we can go ahead to implement our MLP-Mixer model for classification. This model would accept an input RGB image and output class probabilities.
āļø The End Game
Weāll go through this code snippet line by line.
- First create an Input layer which takes in RGB images of some desired size.
- Implement a
Conv2D
which creates patches ( remember, we discussed this decades ago ). Also, add aReshape
layer to reshape the tokens and transform them into 3D tensors. - Add
num_mixer_layers
Mixer layers in the model. - Finally, a
LayerNormalization
layer along with aGlobalAveragePooling1D
layer. - Finally, a
Dense
layer with our favorite softmax activation.
Hereās the output of tf.keras.utils.plot_model
depicted a single Mixer Layer,
The output of model.summary()
,
Thatās All, weāve just implemented a MLP-Mixer model in TensorFlow!
šŖš¼ More projects/blogs/resources from the author
Thanks
Hope you liked the story! Feel free to reach me at equipintelligence@gmail.com. Thank You and have a nice day ahead!