Review: DeepLabv3 — Atrous Convolution (Semantic Segmentation)

Rethink DeepLab, Better Than PSPNet (The Winner in 2016 ILSVRC Scene Parsing Challenge)

Published in

Towards Data Science

8 min readJan 19, 2019

In this story, DeepLabv3, by Google, is presented. After DeepLabv1 and DeepLabv2 are invented, authors tried to RETHINK or restructure the DeepLab architecture and finally come up with a more enhanced DeepLabv3. DeepLabv3 outperforms DeepLabv1 and DeepLabv2, even with the post-processing step Conditional Random Field (CRF) removed, which is originally used in DeepLabv1 and DeepLabv2.

Hence, the paper name is called “Rethinking Atrous Convolution for Semantic Image Segmentation”. It is called “Rethinking …” to companion to the paper name of Inception-v3, called “Rethinking the Inception Architecture for Computer Vision”, in which Inception-v1 (GoogLeNet) and Inception-v2 (Batch Norm) are restructured to become Inception-v3. But right now, DeepLabv2 is restructured as DeepLabv3 here. And it is a 2017 arXiv tech report with more than 200 citations. (Sik-Ho Tsang @ Medium)

Outline

Atrous Convolution
Going Deeper with Atrous Convolution Using Multi-Grid
Atrous Spatial Pyramid Pooling (ASPP)
Ablation Study on PASCAL VOC 2012
Comparison with State-of-the-art Approaches on PASCAL VOC 2012
Comparison with State-of-the-art Approaches on Cityscape

1. Atrous Convolution

**Atrous Convolution with Different Rates r**

For each location i on the output y and a filter w, atrous convolution is applied over the input feature map x where the atrous rate r corresponds to the stride with which we sample the input signal.
This is equivalent to convolving the input x with upsampled filters produced by inserting r-1 zeros between two consecutive filter values along each spatial dimension. (trous means holes in English.)
When r=1, it is standard convolution.
By adjusting r, we can adaptively modify filter’s field-of-view.
It is also called dilated convolution (DilatedNet) or Hole Algorithm.

**Standard Convolution (Top) Atrous Convolution (Bottom)**

Top: Standard convolution.
Bottom: Atrous convolution. We can see that when rate = 2, the input signal is sampled alternatively. First, pad=2 means we pad 2 zeros at both left and right sides. Then, with rate=2, we sample the input signal every 2 inputs for convolution. Atrous convolution allows us to enlarge the field of view of filters to incorporate larger context. It thus offers an efficient mechanism to control the field-of-view and finds the best trade-off between accurate localization (small field-of-view) and context assimilation (large field-of-view).

2. Going Deeper with Atrous Convolution Using Multi-Grid

(a) Without Atrous Conv: Standard conv and pooling are performed which makes the output stride increasing, i.e. the output feature map smaller, when going deeper. However, consecutive striding is harmful for semantic segmentation because location/spatial information is lost at the deeper layers.
(b) With Atrous Conv: With atrous conv, we can keep the stride constant but with larger field-of-view without increasing the number of parameters or the amount of computation. And finally, we can have larger output feature map which is good for semantic segmentation.
For example, when output stride = 16 and Multi Grid = (1, 2, 4), the three convolutions will have rates = 2×(1, 2, 4) = (2, 4, 8) in the block4, respectively.

3. Atrous Spatial Pyramid Pooling (ASPP)

ASPP has been introduced in DeepLabv2. This time, batch normalization (BN) from Inception-v2 is included into ASPP.
The reason of using ASPP is that it is discovered as the sampling rate becomes larger, the number of valid filter weights (i.e., the weights that are applied to the valid feature region, instead of padded zeros) becomes smaller.
one 1×1 convolution and three 3×3 convolutions with rates = (6, 12, 18) when output stride = 16.
Also, image pooling, or image-level feature, by ParseNet, is also included for global context. (Please read my ParseNet review if interested.)
All with 256 filters and batch normalization.
Rates are doubled when output stride = 8.
The resulting features from all the branches are then concatenated and pass through another 1×1 convolution (also with 256 filters and batch normalization) before the final 1×1 convolution which generates the final logits.

Others

Upsampling Logits

In DeepLabv2, the target ground-truths are downsampled by 8 during training.
In DeepLabv3, It is found that it’s important to keep the ground-truths intact and instead upsample the final logits.

4. Ablation Study on PASCAL VOC 2012

4.1. Output Stride

**Going deeper with atrous convolution when employing ResNet-50 with block7 and different output stride.**

When employing ResNet-50 with block7 (i.e., extra block5, block6, and block7). As shown in the table, in the case of output stride = 256 (i.e., no atrous convolution at all), the performance is much worse.
When output stride gets larger and apply atrous convolution correspondingly, the performance improves from 20.29% to 75.18%, showing that atrous convolution is essential when building more blocks cascadedly for semantic segmentation.

4.2. ResNet-101

ResNet-101 is consistently better than ResNet-50 without any surprise.
Noticeably, employing block7 to ResNet-50 decreases slightly the performance while it still improves the performance for ResNet-101.

4.3. Multi-Grid

**Employing multi-grid method for ResNet-101 with different number of cascaded blocks at output stride = 16.**

Applying multi-grid method is generally better than the vanilla version where (r1, r2, r3) = (1, 1, 1).
Simply doubling the unit rates (i.e. (r1, r2, r3) = (2, 2, 2)) is not effective.
Going deeper with multi-grid improves the performance.
The best model is the case where block7 and (r1, r2, r3) = (1, 2, 1) are employed.

4.4. Inference Strategy

**Inference strategy on the val set. MG**: Multi-grid. OS: output stride. MS: Multi-scale inputs during test. **Flip**: Adding left-right flipped inputs.

The model is trained with output stride = 16.
When using output stride = 8 (OS=8) during inference to get more detailed feature map, the performance is improved by 1.39%.
When using multi-scale (MS) inputs with scales = {0.5, 0.75, 1.0, 1.25, 1.5, 1.75} as well as using left-right flipped image and average the probabilities, the performance is further improved to 79.35%.

4.5. ASPP

**ASPP with MG method and image-level features at output stride = 16.**

Image pooling, or image-level feature, by ParseNet, is also included for global context. (Please read my ParseNet review if interested.)
Adopting Multi Grid = (1, 2, 4) in the context of ASPP = (6, 12, 18) is better than Multi Grid = (1, 1, 1) and (1, 2, 1).
Using ASPP = (6, 12, 18) is better than ASPP = (6, 12, 18, 24).
With image-level feature, the performance is further improved to 77.21%.

4.6. Crop Size, Upsampling Logits, Batch Norm, Batch Size, Train & Test Output Stride

Using larger crop size of 513 is better than 321.
With upsampling logits and batch norm, 77.21%.
Using batch size of 16 is the best among 4, 8, 12 and 16.
Using Train & Test Output Stride = (8, 8) has 77.21% while Using Train & Test Output Stride = (16, 8) has better results of 78.51%.

4.7. Number of Replicas During Training

TensorFlow is used for training.
Using only 1 replica, 3.65 days of training time is needed.
Using 32 replicas, only 2.74 hours of training time is needed.

4.7. All together

MG(1, 2, 4) + ASPP(6, 12, 18) + Image Pooling: 77.21% is obtained, which is the same result as in 4.5.
Inference output stride = 8, 78.51%.
Multi-Scale (MS) Testing: 79.45%.
Horizontal Flip (Flip): 79.77%.
With pretrained using COCO as well: 82.70%.
It is noted that after rethinking and restructuring, without the use of post-processing CRF (used in DeepLabv2), it is already better than DeepLabv2 with CRF and pre-trained using COCO, of 77.69%.

5. Comparison with State-of-the-art Approaches

5.1. PASCAL VOC 2012 Test Set

DeepLabv3: Further fine-tuning on PASCAL VOC 2012 trainval set, trained with output stride = 8, bootstrapping on hard images. In particular, the images that contain hard classes are duplicated, 85.7%.

The above shown the improvement with bootstrapping on hard images improves segmentation accuracy for rare and finely annotated classes such as bicycle.
And DeepLabv3 outperforms PSPNet, which obtained first place in ILSVRC 2016 Scene Parsing Challenge.
DeepLabv3-JFT: Employing ResNet-101 which has been pretrained on both ImageNet and JFT-300M dataset, 86.9%.

**Qualitative Results (Last Row, Failure Case) on PASCAL VOC 2012**

6. Comparison with State-of-the-art Approaches on Cityscape

6.1. Different Settings

Similar to PASCAL VOC 2012, using output stride of 8 for testing, multi-scale and horizontal flip as well, the performance is improved.

6.2. Cityscape Test Set

In order to obtain better performance for comparison, DeepLabv3 is further trained on the trainval coarse set (i.e., the 3475 finely annotated images and the extra 20000 coarsely annotated images).
More scales and finer output stride during inference are used. In particular, with scales = {0.75, 1, 1.25, 1.5, 1.75, 2} and evaluation output stride = 4, which contributes extra 0.8% and 0.1% respectively on the validation set.
Finally, 81.3% mIOU is achieved on test set, which is slightly better than PSPNet.

DeepLabv3 only outperforms PSPNet for a very small margin, maybe this is also why it is just a tech report in arXiv. But later on, DeepLabv3+ is invented which is much better than DeepLabv3. Hope that I can review DeepLabv3+ later on. :)