Review: Xception — With Depthwise Separable Convolution, Better Than Inception-v3 (Image Classification)

Published in

Towards Data Science

5 min readSep 25, 2018

--

In this story, Xception [1] by Google, stands for Extreme version of Inception, is reviewed. With a modified depthwise separable convolution, it is even better than Inception-v3 [2] (also by Google, 1st Runner Up in ILSVRC 2015) for both ImageNet ILSVRC and JFT datasets. Though it is a 2017 CVPR paper which was just published last year, it’s already had more than 300 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

What Are Covered

Original Depthwise Separable Convolution
Modified Depthwise Separable Convolution in Xception
Overall Architecture
Comparison with State-of-the-art Results

1. Original Depthwise Separable Convolution

The original depthwise separable convolution is the depthwise convolution followed by a pointwise convolution.

Depthwise convolution is the channel-wise n×n spatial convolution. Suppose in the figure above, we have 5 channels, then we will have 5 n×n spatial convolution.
Pointwise convolution actually is the 1×1 convolution to change the dimension.

Compared with conventional convolution, we do not need to perform convolution across all channels. That means the number of connections are fewer and the model is lighter.

2. Modified Depthwise Separable Convolution in Xception

**The Modified Depthwise Separable Convolution used as an Inception Module in Xception, so called “extreme” version of Inception module (n=3 here)**

The modified depthwise separable convolution is the pointwise convolution followed by a depthwise convolution. This modification is motivated by the inception module in Inception-v3 that 1×1 convolution is done first before any n×n spatial convolutions. Thus, it is a bit different from the original one. (n=3 here since 3×3 spatial convolutions are used in Inception-v3.)

Two minor differences:

The order of operations: As mentioned, the original depthwise separable convolutions as usually implemented (e.g. in TensorFlow) perform first channel-wise spatial convolution and then perform 1×1 convolution whereas the modified depthwise separable convolution perform 1×1 convolution first then channel-wise spatial convolution. This is claimed to be unimportant because when it is used in stacked setting, there are only small differences appeared at the beginning and at the end of all the chained inception modules.
The Presence/Absence of Non-Linearity: In the original Inception Module, there is non-linearity after first operation. In Xception, the modified depthwise separable convolution, there is NO intermediate ReLU non-linearity.

The modified depthwise separable convolution with different activation units are tested. As from the above figure, the Xception without any intermediate activation has the highest accuracy compared with the ones using either ELU or ReLU.

3. Overall Architecture

**Overall Architecture of Xception (Entry Flow > Middle Flow > Exit Flow)**

As in the figure above, SeparableConv is the modified depthwise separable convolution. We can see that SeparableConvs are treated as Inception Modules and placed throughout the whole deep learning architecture.

And there are residual (or shortcut/skip) connections, originally proposed by ResNet [3], placed for all flows.

**ImageNet: Validation Accuracy Against Gradient Descent Steps**

As seen in the architecture, there are residual connections. Here, it tests for Xception using non-residual version. From the above figure, we can see that the accuracy is much higher when using residual connections. Thus, the residual connection is extremely important !!!

4. Comparison with State-of-the-art Results

2 datasets are tested. One is ILSVRC. One is JFT.

4.1. ImageNet — ILSVRC

ImageNet, is a dataset of over 15 millions labeled high-resolution images with around 22,000 categories.

ILSVRC uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.3 million training images, 50,000 validation images and 100,000 testing images.

**ImageNet: Xception has the highest accuracy**

Xception outperforms VGGNet [4], ResNet [3], and Inception-v3 [2]. (If interested, please also visit my reviews about them, ads again, lol)

It is noted that, in terms of error rate, not accuracy, the relative improvement is not small!!!

Of course, from the above figure, Xception has better accuracy compared with Inception-v3 along the gradient descent steps.

But if we use non-residual version to compare with Inception-v3, Xception underperforms Inception-v3. Should it be better to have a residual version of Inception-v3 for fair comparison? Anyway, Xception tells us that with both Depthwise Separable Convolution and Residual Connections, it really helps to improve the accuracy.

Xception is claimed to have similar model size with Inception-v3.

4.2. JFT — FastEval14k

JFT is an internal Google dataset for large-scale image classification dataset, first introduced by Prof. Hinton et al., which comprises over 350 million high-resolution images annotated with labels from a set of 17,000 classes.

An auxiliary dataset, FastEval14k, is used. FastEval14k is a dataset of 14,000 images with dense annotations from about 6,000 classes (36.5 labels per image on average).

As multiple objects are appeared in one single image densely, mean accuracy prediction (mAP) is used for measurement.