Review: DilatedNet — Dilated Convolution (Semantic Segmentation)

a.k.a. “atrous convolution”, “algorithme à trous” and “hole algorithm

Sik-Ho Tsang
Towards Data Science

--

This time, Dilated Convolution, from Princeton University and Intel Lab, is briefly reviewed. The idea of Dilated Convolution is come from the wavelet decomposition. It is also called “atrous convolution”, “algorithme à trous” and “hole algorithm”. Thus, any ideas from the past are still useful if we can turn them into the deep learning framework.

And this dilated convolution has been published in 2016 ICLR with more than 1000 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

What Are Covered

  1. Dilated Convolution
  2. Multi-Scale Context Aggregation (The Context Module)
  3. Results

1. Dilated Convolution

Standard Convolution (Left), Dilated Convolution (Right)

The left one is the standard convolution. The right one is the dilated convolution. We can see that at the summation, it is s+lt=p that we will skip some points during convolution.

When l=1, it is standard convolution.

When l>1, it is dilated convolution.

Standard Convolution (l=1)
Dilated Convolution (l=2)

The above illustrate an example of dilated convolution when l=2. We can see that the receptive field is larger compared with the standard one.

l=1 (left), l=2 (Middle), l=4 (Right)

The above figure shows more examples about the receptive field.

2. Multi-Scale Context Aggregation (The Context Module)

A context module is constructed based on the dilated convolution as below:

The Basic Context Module and The Large Context Module

The context module has 7 layers that apply 3×3 convolutions with different dilation factors. The dilations are 1, 1, 2, 4, 8, 16, and 1.

The last one is the 1×1 convolutions for mapping the number of channels to be the same as the input one. Therefore, the input and the output has the same number of channels. And it can be inserted to different kinds of convolutional neural networks.

The basic context module has the only 1 channels (1C) throughout the module while the large context module has increasing number of channels from 1C as input to 32C at 7th layer.

3. Results

3.1. PASCAL VOC 2012

VGG-16 is used as the front-end module. The last two pooling and striding layers are removed entirely, and the context module is plugged in. The padding of the intermediate feature maps is also removed. Authors only pad the input feature maps by a width of 33. Zero padding and reflection padding yielded similar results in our experiments. Also, a weight initialization considering the number of channels of input and output is used instead of standard random initialization.

PASCAL VOC 2012 Test Set

By comparing the public models trained by the original author, Dilated Convolution outperforms both FCN-8s and the DeepLabv1 by about 5 percentage points on the test set.

67.6% mean IoU can be obtained.

PASCAL VOC 2012 Validation Set

By training on additional images from the Microsoft COCO dataset as well, ablation study is done for Dilated Convolution itself as shown above.

  • Front end: Front end module
  • Basic: Basic context module
  • Large: Large context module
  • CRF: A post processing step using conditional random field in DeepLabv1 and DeepLabv2
  • RNN: A post processing step using conditional random field via recurrent neural network

We can see that the Dilated Convolution (Basic or Large) can always improve the results and does not overlap with any other post-processing steps.

And 73.9% mean IoU can be obtained.

PASCAL VOC 2012 Test Set

The front end module in the above table, is also obtained by training on additional images from the Microsoft COCO dataset. With CRF-RNN (i.e. RNN in the previous table), 75.3% mean IoU is obtained.

3.2. Qualitative Results

PASCAL VOC 2012

All used VGG-16 for the feature extraction, using dilated convolution has a better quality on the segmentation results

PASCAL VOC 2012

With CRF-RNN as post-processing step, a little better result is obtained. But CRF-RNN makes the process not an end-to-end learning.

Failure Cases

Some failure cases as as shown above, when the object is occluded, the segmentation goes wrong.

Different dataset are tried in the appendix, i.e. CamVid, KITTI, and Cityscapes, please feel free to read the paper. They also published the Dilated Residual Networks. Hope I can cover it in the future. :)

References

[2016 ICLR] [Dilated Convolutions]
Multi-Scale Context Aggregation by Dilated Convolutions

My Related Reviews

[VGGNet] [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2]

--

--