Exploring MobileNets: From Paper To Keras
Dismantling MobileNets to see how they are lightweight and so efficient.
MobileNets are the popular architectures used for image classification, face detection, segmentation and many more. They are known for their latency on mobile and embedded devices. You may infer this from the name “MobileNet”. They have far less number of trainable parameters as they use Separable Convolutions. If you’re running an image classification model using live frames from the camera, you’ll probably need something which is fast, accurate and consumes less memory of the mobile device.
Today, we’ll create a MobileNet model right from its research paper to TensorFlow ( Keras! ). You may ensure the points below before reading the story further,
- If you’re untouched by the concept of Separable Convolutions, read “A Basic Introduction to Separable Convolutions”.
- The Colab notebook contains the TF implementation of MobileNet V1 only.
- All terms written in bold and italics like “sample” could be found in the research paper directly.
- You may see the MobileNet implementation on tensorflow/models repo on GitHub.
I would insist you open the TensorFlow implementation of MobileNet in another tab,
MobileNets use Separable Convolutions. But what are Separable Convolutions? What’s “separable“ in them?
Separable convolutions consist of two (separated) convolutions underneath. They are depthwise convolutions and pointwise convolutions. Depthwise convolutions take in a feature map, run a kernel on each of the input kernels. Pointwise Convolutions increase the number of output channels.
I will highly recommend reading A Basic Introduction to Separable Convolutions by Chi-Feng Wang.
In the paper, all convolutions are considered as padded. So after the convolution, the size of the input and output feature maps are the same. So in the following two diagrams, Df — Dₖ + 1 is equal to Df only.
Depthwise Convolutions
Suppose, we are having M square feature maps of size Df. Using a kernel of size Dₖ, we are producing an output feature map of size Df — Dₖ + 1 ( assuming no padding and a stride of 1 ). We repeat this for all M input feature maps and at the end, we are left with a feature map of dimensions, Df — Dₖ + 1 × Df — Dₖ + 1 × M. Note, we’ll use M different kernels for M channels of the input feature map. This is our “depthwise convolution”.
The number of multiplications ( or “computational cost” as mentioned in the paper ) will be,
Pointwise Convolutions
The output feature map produced above has M channels whereas we require N output channels. So, to increase the output dimensions, we use a 1 × 1 Convolution. These are called “pointwise convolutions”. We use a kernel of size 1 × 1 × M and produce a single feature map of size Df — Dₖ + 1 × Df — Dₖ + 1 × 1. We repeat this for N times and we are left with an output feature map of size Df — Dₖ + 1 × Df — Dₖ + 1 × N. The computational cost is,
For a standard convolution, the computational cost would have been,
For a separable convolution ( depthwise + pointwise ), the computational cost will be,
Also, the reduction of the parameters has been calculated in the paper too.
Width and Resolution Multipliers
Although MobileNet has a considerable decrease in trainable parameters still you want it to be fast, for that we introduce a parameter called “width multiplier”. It’s denoted by α. So, any layer in the model will receive αM feature maps and produces αN feature maps. It makes the MobileNet model thinner and increases latency. For simplicity, we’ll set α = 1.0 where α ∈ ( 0 , 1].
To further reduce the computational cost, they have also introduced a “Resolution Multiplier” denoted as ρ. It reduces the input image as well as the internal representation for every layer.
The TensorFlow Implementation ( With Keras )
First, we’ll have a look at the architecture from the paper itself. It looks like this,
Here, “Conv /s2” represents a convolutional layer ( not depthwise ) with strides 2. “Conv dw /s1” denotes a separable convolution with strides 1. All layers are succeeded by Batch Normalization and LeakyReLU layers.
The implementation in Keras will look like this,
Finally, the model is made with all 29 layers packed within,
We’ll train our model on the Rock Paper Scissors dataset by Laurence Moroney. It’s hosted on TensorFlow Datasets for our ease. The original MobileNet was evaluated on a number of datasets including ImageNet.
You will find the training part in the Colab notebook. Congrats! You’ve just created a MobileNet from scratch.
Exploring Further…
The End
I hope you liked the concept of MobileNet. You may read the research paper thoroughly if you want to train the model on big datasets. They’ve included some hyperparameters too which would help you in training. Thanks and Good Bye.