Understanding Depthwise Separable Convolutions and the efficiency of MobileNets

Explanation of MobileNets and Depthwise Separable Convolutions

Arjun Sarkar
Towards Data Science

--

Introduction:

In convolutional neural networks (CNN), 2D convolutions are the most frequently used convolutional layer. MobileNet is a CNN architecture that is much faster as well as a smaller model that makes use of a new kind of convolutional layer, known as Depthwise Separable convolution. Because of the small size of the model, these models are considered very useful to be implemented on mobile and embedded devices. Hence the name MobileNet.

The original paper on MobileNets is available here — MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Table 1. MobileNet parameter and accuracy comparison against GoogleNet and VGG 16 (Source: Table from the original paper)

Table 1 clearly shows how even the deepest MobileNet architectures have fewer parameters than the famous CNN architectures. The smaller MobileNets have as few as 1.3 million parameters. Also, the size of the models is significantly less. While a VGG16 model can take up to 500 MB of disk space, MobileNet just needs 16–18MB. This makes it ideal to be loaded on mobile devices.

Figure 1. MobileNets can be used for Object Detection, Classification, Attribute detection, or even Landmark recognition on mobile devices (Source: Image from the original paper)

Depthwise Convolutions:

Differences

The main difference between 2D convolutions and Depthwise Convolution is that 2D convolutions are performed over all/multiple input channels, whereas in Depthwise convolution, each channel is kept separate.

Approach

  1. Input tensor of 3 dimensions is split into separate channels
  2. For each channel, the input is convolved with a filter (2D)
  3. The output of each channel is then stacked together to get the output on the entire 3D tensor

Graphical Description

Figure 2. Diagramatic explanation of Depthwise Convolutions (Source: Image created by author)

Depthwise Separable Convolutions:

Depthwise convolutions are generally applied in combination with another step — Depthwise Separable Convolution. This contains two parts— 1. Filtering (all the previous steps) and 2. Combining (combining the 3 color channels to form ’n’ number of channels, as desired — in the example below we see how the 3 channel can be combined to form a 1 channel output).

Figure 3. Diagramatic explanation of Depthwise Separable Convolutions (Source: image created by author)

Why is Depthwise Separable Convolution so efficient?

Depthwise Convolution is -1x1 convolutions across all channels

Let's assume that we have an input tensor of size — 8x8x3,

And the desired output tensor is of size — 8x8x256

In 2D Convolutions

Number of multiplications required — (8x8) x (5x5x3) x (256) = 1,228,800

In Depthwise Separable Convolutions

The number of multiplications required:

a. Filtering — Split into single channels, so 5x5x1 filter is required in place of 5x5x3, and since there are three channels, so the total number of 5x5x1 filters required is 3, so,

(8x8) x (5x5x1) x (3) = 3,800

b. Combining — Total number of channels required is 256, so,

(8x8) x (1x1x3) x (256) = 49,152

Total number of multiplications = 3,800+49,152 = 53,952

So a 2D convolution will require 1,228,800 multiplications, while a Depthwise Separable convolution will require only 53,952 multiplications to reach the same output.

Finally,

1,228,800/53,952 = 23x less multiplications required

Hence the efficiency of Depthwise Separable convolutions is so high. These are the layers implemented in the MobileNet architecture, to decrease the number of computations and making them less power-hungry, so that they can be run on mobile / embedded devices which do not have powerful graphical processing units in them.

MobileNet

  1. Using Depthwise Separable Convolutions
  2. Using two Shrinking Hyperparameters:

a. Width multiplier which adjusts the number of channels

Table 2. Width multiplier in MobileNet (Source: Table from the original paper)

b. Resolution multiplier which adjusts the spatial dimensions of the feature maps and the input image

Table 3. Resolution multiplier in MobileNet (Source: Table from the original paper)

Architecture — The first layer of the MobileNet is a full convolution, while all following layers are Depthwise Separable Convolutional layers. All the layers are followed by batch normalization and ReLU activations. The final classification layer has a softmax activation. The full architecture is shown in Table 4.

Table 4. MobileNet architecture (Source: table from the original paper)
Figure 4. Left: Standard Convolutional layer, Right: Depthwise Separable Convolutional layers in MobileNet (Source: image from the original paper)

Figure 4 shows the difference in architecture flow between normal CNN models vs MobileNets. On the left of the image, we see a 3x3 Convolutional layer followed by batch normalization and ReLU, while, on the right, we see the Depthwise Separable Convolutional layer — consisting of a 3x3 Depthwise Convolution with a batch norm and ReLU followed by a 1x1 pointwise convolution followed by a batch norm and ReLU.

Conclusion:

MobileNets are very efficient and small deep learning architectures specially designed for mobile devices. Due to the small size, there is a tradeoff of accuracy when compared with the larger fully convolutional architectures, but that is very minute. For example, on the Stanford dogs dataset, while the Inception V3 model gets an accuracy of 84%, the largest MobileNet gets an accuracy of 83.3%. But if we look at the number of parameters of each model architecture, while the Inception V3 has 23.2 million parameters, the MobileNet has only 3.3 million parameters. Also, it is possible to make even smaller and faster MobileNet versions just by using a width multiplier or a resolution multiplier. This makes MobileNets a highly sought-after deep learning model on mobile and embedded devices.

Next, I intend to show how to create a MobileNet architecture from scratch in python, using TensorFlow.

References:

  1. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M. & Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (cite arxiv:1704.04861)

--

--