Review: DRN — Dilated Residual Networks (Image Classification & Semantic Segmentation)

Using Dilated Convolution, Improved ResNet, for Image Classification, Image Localization & Semantic Segmentation

Sik-Ho Tsang
Towards Data Science

--

In this story, DRN (Dilated Residual Networks), from Princeton University and Intel Labs, is reviewed. After publishing DilatedNet in 2016 ICML for semantic segmentation, authors invented the DRN which can improve not only semantic segmentation, but also image classification, without increasing the model’s depth or complexity. It is published in 2017 CVPR with over 100 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Dilated Convolution
  2. Reasons of Dilated Convolution
  3. Dilated Residual Networks (DRN)
  4. Localization
  5. Degridding
  6. Results

1. Dilated Convolution

  • For simplicity, I just quote the equations in DilatedNet:
Standard Convolution (Left), Dilated Convolution (Right)
  • The left one is the standard convolution. The right one is the dilated convolution. We can see that at the summation, it is s+lt=p that we will skip some points during convolution.
  • When l=1, it is standard convolution.
  • When l>1, it is dilated convolution.
Standard Convolution (l=1) (Left) Dilated Convolution (l=2) (Right)
  • The above illustrate an example of dilated convolution when l=2. We can see that the receptive field is larger compared with the standard one.
l=1 (left), l=2 (Middle), l=4 (Right)
  • The above figure shows more examples about the receptive field.

2. Reasons of Dilated Convolution

  • It is found that with small output feature map obtained at the end of the network, the accuracy is reduced in semantic segmentation.
  • In FCN, it also shows that when 32× upsampling is needed, we can only get a very rough segmentation results. Thus, a larger output feature map is desired.
  • A naive approach is to simply remove subsampling (striding) steps in the network in order to increase the resolution of feature map. However, this also reduces the receptive field which severely reduces the amount of context. such reduction in receptive field is an unacceptable price to pay for higher resolution.
  • For this reason, dilated convolutions are used to increase the receptive field of the higher layers, compensating for the reduction in receptive field induced by removing subsampling.
  • And it is found that using dilated convolution can also help for image classification task in this paper.

3. Dilated Residual Networks (DRN)

  • In the paper, it uses d as dilation factor.
  • When d=1, it is standard convolution.
  • When d>1, it is dilated convolution.

Original ResNet

  • In original ResNet, final 2 groups of convolutional layers G4 and G5 uses 3×3 standard convolution (d=1):
  • The feature maps are getting smaller due to the max pooling.
  • The output feature map has the size of 7×7 only. This is not good as the reason mentioned in previous section.

DRN

  • In DRN, at G4, d=2 is used:
  • At G5, for the first convolution (i=1), d=2 is still used:
  • At G5, for the remaining convolution (i>1), d=4 is used:
  • Finally, the output of G5 in DRN is 28×28 which is much larger than that of original ResNet.

4. Localization

  • For image classification task, at the end, there is a global average pooling followed by a 1×1 convolution and softmax.
  • To configure for localization, the average pooling is just simply removed. No training or parameter tuning is involved. The accurate classification DRN can be used for localization directly.

5. Degridding

A Gridding Artifact
  • Gridding artifacts occur when a feature map has higher-frequency content than the sampling rate of the dilated convolution, as shown above.
DRN-A (Top) DRN-B (Middle) DRN-C (Bottom)
  • DRN-A: It is the one with only dilated convolution, which has gridding artifact.
  • DRN-B: It is found that the first max pooling operation leads to high-amplitude high-frequency activations. Thus, the first max pooling layer is replaced by 2 residual blocks (four 3×3 convolution layer) to reduce the gridding artifact. And 2 more residual blocks are also added at the end of network.
  • DRN-C: At the end of the network, the dilation is progressively lower to remove the aliasing artifacts, i.e. a 2-dilated convolution followed by a 1-dilated convolution. However, the artifact is still here due to the fact it can be passed through residual connections. Thus, the corresponding residual connections are removed.
Activation Maps of ResNet-18 and Corresponding DRNs
  • Above shows an visualization.
  • DRN-A-18: With dilated convolution, there is gridding artifact.
  • DRN-B-26: With convolutions replacing max pooling, the feature map has less artifact.
  • DRN-C-26: With progressive smaller dilated convolution and removing residual connections, the artifact is further reduced.
Feature Map Visualization at Different Levels in DRN-C-26 (The highest average activation at each level is shown)

6. Results

6.1. Image Classification on ImageNet

Top-1 & Top-5 Error Rates on ImageNet Validation Set
  • DRN-A-18 and DRN-A-34 outperform ResNet-18 and ResNet-34 in 1-crop top-1 accuracy by 2.43 and 2.92 percentage points, respectively. (A 10.5% error relative reduction in the case of ResNet-34 to DRN-A-34.)
  • DRN-A-50 outperforms ResNet-50 in 1-crop top-1 accuracy by more than a percentage point.
  • The direct transformation of a ResNet into a DRN-A, which does not change the depth or capacity of the model at all, significantly improves classification accuracy.
  • Each DRN-C significantly outperforms the corresponding DRN-A.
  • DRN-C-26, which is derived from DRN-A-18, matches the accuracy of the deeper DRN-A-34.
  • DRN-C-42, which is derived from DRN-A-34, matches the accuracy of the deeper DRN-A-50.
  • DRN-C-42 approaches the accuracy of ResNet-101, although the latter is deeper by a factor of 2.4.

6.2. Object Localization on ImageNet

  • Here, a weakly-supervised object localization is performed based on the feature map activation values.
  • C=1000 as it is a 1000-class ImageNet dataset.
  • With C response maps of resolution W×H, f(c, w, h) is the response at location (w, h), the dominant class at each location is g(w, h). The set of bounding boxes is Bi where t an activation threshold. And the minimal bounding box bi is chosen among Bi.
  • With IoU with ground-truth box larger than 0.5, it is considered to be accurate.
Top-1 & Top-5 Localization Error Rates on ImageNet Validation Set
  • DRNs outperform the corresponding ResNet models, illustrates the benefits of the basic DRN construction.
  • DRN-C-26 significantly outperforms DRN-A-50, despite having much lower depth. This shows that degridding scheme particularly significant benefits for applications that require more detailed spatial image analysis.
  • DRN-C-26 also outperforms ResNet-101.

6.3. Semantic Segmentation on Cityscape

  • For ResNet-101, it got 66.6% mean IoU.
  • DRN-C-26 outperforms the ResNet-101 baseline by more than a percentage point, despite having 4 times lower depth.
  • The DRN-C-42 model outperforms the ResNet-101 baseline by more than 4 percentage points, despite 2.4 times lower depth.
  • Both DRN-C-26 and DRN-C-42 outperform DRN-A-50, suggesting that the degridding construction is particularly beneficial for dense prediction tasks.
Cityscape Dataset
  • As shown above, the predictions of DRN-A-50 are marred by gridding artifacts even though the model was trained with dense pixel-level supervision.
  • In contrast, the predictions of DRN-C-26 are not only more accurate, but also visibly cleaner.

6.4. More Results Using DRN-D

  • There is a DRN-D in Authors’ GitHub as well which is a simplified version of DRN-C.
Classification error rate on ImageNet validation set and numbers of parameters.
Classification error rate on ImageNet validation set and numbers of parameters

All DRN also can obtain lower error rate while having fewer number of parameters (smaller model).

Segmentation mIoU and number of parameters ( *trained with poly learning rate, random scaling and rotations.)
  • DRN-D-22, with fewer number of parameters, achieving 68% mIoU which is the same as that of DRN-C-26 and higher than that of DRN-A-50.

Rather than progressively reducing the resolution of internal representations until the spatial structure of the scene is no longer discernible, high spatial resolution is kept all the way through the final output layers. Image classification accuracy is improved and finally DRN outperforms state-of-the-art ResNet.

--

--