🤖 Deep Learning

MLP Mixer Is All You Need?

Understanding MLP-Mixers from beginning to the end, with TF Keras code

Shubham Panchal

Published in

Towards Data Science

10 min readJun 13, 2021

Source: ***“MLP-Mixer: An all-MLP Architecture for Vision”***

Earlier this May, a group of researchers from Google released a paper “MLP-Mixer: An all-MLP Architecture for Vision” introducing their MLP-Mixer ( Mixer, for short ) model for solving computer vision problems. The research suggests that MLP-Mixer attains competitive scores on image classification benchmarks such as the ImageNet.

One thing that would catch every ML developer’s eyes, is that they haven’t used convolutions in their architecture. Convolutions have reigned computer vision since long as they are efficient in extracting spatial information from images and videos. Recently, Transformers, that were originally used for NLP problems, have shown remarkable results in computer vision problems as well. The research paper for MLP-Mixer suggests,

In this paper we show that while convolutions and attention are both sufficient for good performance, neither of them are necessary.

There had been some controversies whether MLP-Mixers are “conv-free” or not. Explore this blog from Weights and Biases to know more,

Is MLP-Mixer a CNN in Disguise?

Recently, a new kind of architecture - MLP-Mixer: An all-MLP Architecture for Vision ( Tolstikhin et al, 2021) - was…

wandb.ai

We’ll discuss more on MLP-Mixer’s architecture and underlying techniques involved. Finally, we provide a code implementation for MLP-Mixer using TensorFlow Keras.

Also, this blog has been showcased on the Google Dev Library.

You can now find pretrained MLP-Mixer models on TensorFlow Hub, https://tfhub.dev/sayakpaul/collections/mlp-mixer/1

I have used MLP-Mixers for text classification as well,

Tweet Classification With MLP-Mixers ( TF-Keras )

Explore and run machine learning code with Kaggle Notebooks | Using data from Natural Language Processing with Disaster…

www.kaggle.com

📃 Contents

👉 Dominance of Convolutions, advent of Transformers
👉 Multilayer Perceptron ( MLP ) and the GELU activation function
👉 MLP-Mixer Architecture Components
👉 The End Game
👉 More projects/blogs/resources from the author

👼 Dominance of Convolutions, advent of Transformers

The use of convolutions in computer vision problems was popularized by Yann LeCun and since then convolutions have served as the backbone for computer vision problems. Each filter is convolved over the input volume to compute an activation map made of neurons, as depicted below.

Fig 1: A convolution operation with kernel size=3 and strides=1 ( with not padding ). Source: Convolution arithmetic

Each neuron in the output map is connected to a specific part of the input volume, which can be observed clearly in fig. 1. The output map is then passed to an activation function ( such as ReLU ). In order to decrease the dimensionality of the output maps, a Pooling operation is used. Convolutions are be used to learn local features in an image, which is the goal of computer vision problems. Nearly all architectures like MobileNets, Inception, ResNet, DenseNet etc. use convolutional layers ( which is convolution + activation ) to learn image features.

Transformers were created for NLP problems, but have shown considerable results in image classification. I’ll leave some resources here for Vision Transformers ( ViTs ),

Keras documentation: Image classification with Vision Transformer

Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision…

keras.io

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformers)

This article was published as a part of the Data Science Blogathon. Introduction While the Transformer architecture has…

www.analyticsvidhya.com

Are You Ready for Vision Transformer (ViT)?

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” May Bring Another Breakthrough to Computer…

towardsdatascience.com

🤠 Multilayer Perceptron ( MLP ) and the GELU activation function

Multilayer Perceptron ( MLP )

If you’re a experienced ML developer, you might have learnt this in your ancient times.

Fig 2: A Multilayer Perceptron. Source: Multilayer perceptron example

A multilayer perceptron is a artificial neural network with an input layer, multiple hidden layers and an output layer. Except for input nodes, every node uses a non-linear activation function.

In our case, the research paper suggests an MLP with 2 fully-connected ( dense ) layers with a GeLU ( we’ll discuss more on this in the coming sections ) activation function,

Fig 3: MLP for the Mixer architecture. Source: *“MLP-Mixer: An all-MLP Architecture for Vision”*

Each Mixer layer will consist of two MLPs, one for token mixing and another for channel mixing. We discuss more on token mixing and channel mixing in later sections of the story. Here’s the code which we’ll use to stack two Dense layers ( with a GELU ) activation thereby adding a MLP to the existing layers ( x ),

Snippet 1: The MLP

Note: The dropout won’t be seen in the figure 3. We add it so as to regularize our model. You can also notice them in this Keras example.

Note, most of us would be thinking that Dense layers accept inputs of shape ( batch_size , input_dims ) and outputs tensors of shape ( batch_size, output_dims ) . But in our case, these Dense layers will receive inputs shapes of 3 dimensions meaning of shape ( batch_size , num_patches , channels ) or its transpose ( batch_size , channels , num_patches ) .

Fig 4: Information regarding the input/output shapes for the Dense layer. Source: TensorFlow documentation for Dense layer.

We’ll learn more on num_channels and num_patches in later sections of story.

GELU ( Gaussian Error Linear Unit ) Activation

Modern but not-so-popular activation function

Fig 5: Graphs of ReLU, GELU and ELU. Source: Gaussian Error Linear Units (GELUs) on ArXiv

Gaussian Error Linear Units is an activation function which weighs inputs using the standard Gaussian cumulative distribution function. In case of ReLU ( Rectified Linear Units ), the inputs are weighed using their sign.

Fig 6: ReLU and GELU activation functions. Source: Created by Author.

The inputs have a high probability of being dropped out as x decreases. The transformation on x is stochastic yet it depends on the value of x.

Fig 7: Approximation to GELU, provided by the authors. Source: Gaussian Error Linear Units (GELUs) on ArXiv

The reason behind choosing a Normal distribution is that when using a Batch Normalization layer, the outputs of the neurons follow a Normal distribution. The GELU activation is widely used in Transformer models for solving NLP problems.

As observed in snippet 1, we’ll use tf.nn.gelu to add GELU activation to the MLPs. If you want a Keras layer, there is a tfa.layers.GELU layer in the TensorFlow Addons package.

Here’s a nice blog explaining various activation functions ( including GELU ),

Activation Functions Explained - GELU, SELU, ELU, ReLU and more

During the calculations of the values for activations in each layer, we use an activation function right before…

mlfromscratch.com

🔧 MLP-Mixer Architecture Components

We’ll discuss each component in detail and then integrate all of them in one piece of code.

⚙️Mixer Layers

Snippet 2: Mixer layer

Mixer layers are the building blocks of the MLP-Mixer architecture. Each Mixer layer contains two MLPs, one for token mixing and another for channel mixing. Alongside, MLPs, you’ll notice Layer Normalization, skip-connections and the “T” written above the arrow. It refers to the transpose* of the tensor, keeping the batch dimension intact.

transpose*: We’ll use the tf.keras.layers.Permute layer to perform transposition by setting dims=[ 2 , 1 ] in the arguments of this layer. We won’t discuss this in detail in coming sections.

I recommend you to absorb the diagram thoroughly, as I’ll refer it now and then, while discussing the components. In the following sections, we discuss,

What are Patches ( inputs of the Mixer layer )
Token Mixing MLPs
Channel Mixing MLPs
Layer Normalization

🧽 What are Patches?

And how do we create them using an RGB image ( which are the typical inputs of your MLP-Mixer model )?

A Mixer layer would take in a tensor of shape ( batch_size , num_patches , num_channels ) and also produce an output of the same shape. You might wonder that how we can produce such a tensor from an RGB image ( which is the actual input of the MLP-Mixer model )? Refer to the diagram below,

Fig 9: Creating Patches from an image using 2D convolutions. Source: Created by Author

Suppose, we are given an RGB image of size 4 * 4. We create patches, which are non-overlapping*, using a 2D convolution. Suppose, we need square patches of size 2 * 2. As seen in fig. 9, we can create 4 non-overlapping patches from the 4 * 4 input image ( a patch is shaded for you in the diagram ). Also, using C filters, we transform the input image of size image_dims * image_dims * 3 to a tensor of shape num_patches * num_patches * C , where num_patches = image_dims / patch_size . Note, we assume that image_dims is perfectly divisible by patch_size .Considering our example *, num_patches = ( 4 * 4 ) / ( 2 * 2 ) = 4.

In the example above, we have num_patches=2 .

our examples*: Special thanks to our reader Dr. Abder-Rahman Ali for pointing out the mistake in calculation of num_patches . We really appreciate his effort towards improving this story.
non-overlapping*: In order to create non-overlapping patches, we set kernel_size=patch_size and strides=patch_size in Keras’ Conv2D layer.

Fig 10: Resizing patches. Source: Created by Author

Finally, we resize the patches of shape num_patches * num_patches * C to num_patches^2 * C .

🧱 Token Mixing MLPs

As discussed earlier, each Mixer layer consists of a token-mixing MLP. We would like to understand the meaning of tokens, which is highlighted in the paper as,

It [ MLP-Mixer ] accepts a sequence of linearly projected image patches (also referred to as tokens) shaped as a “patches * channels” table as an input, and maintains this dimensionality.

Here’s the code for token-mixing MLPs ( the role of LayerNormalization and Permute could be observed from fig. 8 )

Snippet 3: Token Mixing MLPs

As the name suggests, it mixes tokens or in other words, allows communication between different patches in the same channel. As observed in fig. 11, the number of channels C isn’t modified and only P i.e. the number of patches is expanded to some dimension ( token_mixing_mlp_dims ) and brought back to P.

🧱 Channel Mixing MLPs

Channel mixing MLPs do a job similar to token mixing MLPs. They mix channel information thereby enabling communication among channels.

Snippet 4: Channel Mixing MLPs

As observed in fig. 12, the number of patches P isn’t modified and only C i.e. the number of channels is expanded to some dimension ( channel_mixing_mlp_dims ) and brought back to C.

⚖️ Layer Normalization

which is different from batch normalization

Batch Normalization uses the mean and variance of the whole batch to normalize activations. In case of layer normalization ( especially for RNNs ), the mean and variance of all summed inputs of a neuron are used to perform normalization. As mentioned in the paper,

In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity.

The TF-Keras team provides a tf.keras.layers.LayerNormalization layer to perform this operation. Here are some resources to understand Layer Normalization,

Layer Normalization Explained

Recently I came across with layer normalization in the Transformer model for machine translation and I found that a…

leimao.github.io

Different Normalization Layers in Deep Learning

Presently Deep Learning has been revolutionizing many subfields such as natural language processing, computer vision…

towardsdatascience.com

The difference between Batch Normalization and Layer Normalization - Programmer Sought

There are many explanations on the Internet, including the picture below It is not enough to understand by pictures…

www.programmersought.com

Now, with the complete knowledge of Mixer layers, we can go ahead to implement our MLP-Mixer model for classification. This model would accept an input RGB image and output class probabilities.

⚔️ The End Game

Snippet 5: Assembling the model

We’ll go through this code snippet line by line.

First create an Input layer which takes in RGB images of some desired size.
Implement a Conv2D which creates patches ( remember, we discussed this decades ago ). Also, add a Reshape layer to reshape the tokens and transform them into 3D tensors.
Add num_mixer_layers Mixer layers in the model.
Finally, a LayerNormalization layer along with a GlobalAveragePooling1D layer.
Finally, a Dense layer with our favorite softmax activation.

Here’s the output of tf.keras.utils.plot_model depicted a single Mixer Layer,

Fig 14: Output of plot_model. Source: Created by Author.

The output of model.summary() ,

Fig 15: Output of model.summary(). Source: Created by Author

That’s All, we’ve just implemented a MLP-Mixer model in TensorFlow!

💪🏼 More projects/blogs/resources from the author

Thanks

Hope you liked the story! Feel free to reach me at equipintelligence@gmail.com. Thank You and have a nice day ahead!

🤖 Deep Learning

MLP Mixer Is All You Need?

Understanding MLP-Mixers from beginning to the end, with TF Keras code

Is MLP-Mixer a CNN in Disguise?

Recently, a new kind of architecture - MLP-Mixer: An all-MLP Architecture for Vision ( Tolstikhin et al, 2021) - was…

Tweet Classification With MLP-Mixers ( TF-Keras )

Explore and run machine learning code with Kaggle Notebooks | Using data from Natural Language Processing with Disaster…

📃 Contents

👼 Dominance of Convolutions, advent of Transformers

Keras documentation: Image classification with Vision Transformer

Author: Khalid Salama Date created: 2021/01/18 Last modified: 2021/01/18 Description: Implementing the Vision…

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (Vision Transformers)

This article was published as a part of the Data Science Blogathon. Introduction While the Transformer architecture has…

Are You Ready for Vision Transformer (ViT)?

“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” May Bring Another Breakthrough to Computer…

🤠 Multilayer Perceptron ( MLP ) and the GELU activation function

Multilayer Perceptron ( MLP )

GELU ( Gaussian Error Linear Unit ) Activation

Activation Functions Explained - GELU, SELU, ELU, ReLU and more

During the calculations of the values for activations in each layer, we use an activation function right before…

🔧 MLP-Mixer Architecture Components

⚙️Mixer Layers

🧽 What are Patches?

🧱 Token Mixing MLPs

🧱 Channel Mixing MLPs

⚖️ Layer Normalization

Layer Normalization Explained

Recently I came across with layer normalization in the Transformer model for machine translation and I found that a…

Different Normalization Layers in Deep Learning

Presently Deep Learning has been revolutionizing many subfields such as natural language processing, computer vision…

The difference between Batch Normalization and Layer Normalization - Programmer Sought

There are many explanations on the Internet, including the picture below It is not enough to understand by pictures…

⚔️ The End Game

💪🏼 More projects/blogs/resources from the author

Thanks

Written by Shubham Panchal