Exploring MobileNets: From Paper To Keras

Dismantling MobileNets to see how they are lightweight and so efficient.

Published in

Towards Data Science

5 min readApr 9, 2020

MobileNets are the popular architectures used for image classification, face detection, segmentation and many more. They are known for their latency on mobile and embedded devices. You may infer this from the name “MobileNet”. They have far less number of trainable parameters as they use Separable Convolutions. If you’re running an image classification model using live frames from the camera, you’ll probably need something which is fast, accurate and consumes less memory of the mobile device.

Today, we’ll create a MobileNet model right from its research paper to TensorFlow ( Keras! ). You may ensure the points below before reading the story further,

If you’re untouched by the concept of Separable Convolutions, read “A Basic Introduction to Separable Convolutions”.
The Colab notebook contains the TF implementation of MobileNet V1 only.
All terms written in bold and italics like “sample” could be found in the research paper directly.
You may see the MobileNet implementation on tensorflow/models repo on GitHub.

I would insist you open the TensorFlow implementation of MobileNet in another tab,

MobileNets_With_TensorFlow

colab.research.google.com

MobileNets use Separable Convolutions. But what are Separable Convolutions? What’s “separable“ in them?

Separable convolutions consist of two (separated) convolutions underneath. They are depthwise convolutions and pointwise convolutions. Depthwise convolutions take in a feature map, run a kernel on each of the input kernels. Pointwise Convolutions increase the number of output channels.

I will highly recommend reading A Basic Introduction to Separable Convolutions by Chi-Feng Wang.

A Basic Introduction to Separable Convolutions

Explaining spatial separable convolutions, depthwise separable convolutions, and the use of 1x1 kernels in a simple…

towardsdatascience.com

In the paper, all convolutions are considered as padded. So after the convolution, the size of the input and output feature maps are the same. So in the following two diagrams, Df — Dₖ + 1 is equal to Df only.

Depthwise Convolutions

Suppose, we are having M square feature maps of size Df. Using a kernel of size Dₖ, we are producing an output feature map of size Df — Dₖ + 1 ( assuming no padding and a stride of 1 ). We repeat this for all M input feature maps and at the end, we are left with a feature map of dimensions, Df — Dₖ + 1 × Df — Dₖ + 1 × M. Note, we’ll use M different kernels for M channels of the input feature map. This is our “depthwise convolution”.

The number of multiplications ( or “computational cost” as mentioned in the paper ) will be,

When you have only trained your model for a single epoch and you run out of memory!

Pointwise Convolutions

The output feature map produced above has M channels whereas we require N output channels. So, to increase the output dimensions, we use a 1 × 1 Convolution. These are called “pointwise convolutions”. We use a kernel of size 1 × 1 × M and produce a single feature map of size Df — Dₖ + 1 × Df — Dₖ + 1 × 1. We repeat this for N times and we are left with an output feature map of size Df — Dₖ + 1 × Df — Dₖ + 1 × N. The computational cost is,

For a standard convolution, the computational cost would have been,

For a separable convolution ( depthwise + pointwise ), the computational cost will be,

Also, the reduction of the parameters has been calculated in the paper too.

Width and Resolution Multipliers

Although MobileNet has a considerable decrease in trainable parameters still you want it to be fast, for that we introduce a parameter called “width multiplier”. It’s denoted by α. So, any layer in the model will receive αM feature maps and produces αN feature maps. It makes the MobileNet model thinner and increases latency. For simplicity, we’ll set α = 1.0 where α ∈ ( 0 , 1].

Width multiplier as described in the paper. Source.

To further reduce the computational cost, they have also introduced a “Resolution Multiplier” denoted as ρ. It reduces the input image as well as the internal representation for every layer.

Resolution multiplier as described in the paper. Source.

The TensorFlow Implementation ( With Keras )

First, we’ll have a look at the architecture from the paper itself. It looks like this,

Here, “Conv /s2” represents a convolutional layer ( not depthwise ) with strides 2. “Conv dw /s1” denotes a separable convolution with strides 1. All layers are succeeded by Batch Normalization and LeakyReLU layers.

A standard convolutional layer with BatchNorm and LeakyReLU ( Right ). A Separable Convolution ( Left ) with Depthwise and 1 × 1 convolution ( pointwise ). Source

The implementation in Keras will look like this,

Separable Convolutions.

Finally, the model is made with all 29 layers packed within,

Assembling the model.

We’ll train our model on the Rock Paper Scissors dataset by Laurence Moroney. It’s hosted on TensorFlow Datasets for our ease. The original MobileNet was evaluated on a number of datasets including ImageNet.

You will find the training part in the Colab notebook. Congrats! You’ve just created a MobileNet from scratch.

Exploring Further…

The End

Source.

I hope you liked the concept of MobileNet. You may read the research paper thoroughly if you want to train the model on big datasets. They’ve included some hyperparameters too which would help you in training. Thanks and Good Bye.