Review: GCN — Global Convolutional Network, Large Kernel Matters (Semantic Segmentation)

Outperforms FCN-8s, CRF-RNN, DilatedNet, and DeepLabv1 & DeepLabv2

Published in

Towards Data Science

8 min readApr 23, 2019

In this paper, Global Convolutional Network (GCN), By Tsinghua University and Megvii Inc. (Face++), is reviewed. In convention such as VGGNet, stacks of small 3×3 kernels are used, in order to obtain a large effective receptive field. However, it is found that large kernels play an important role as well. For example of the figure above, in A, the receptive field is large enough to cover the bird for segmentation. But if the image is zoomed as in B, the receptive field is not large enough. In C, with the use of proposed GCN, the receptive field can be enlarged. Finally:

GCN is proposed to address both the classification and localization issues for the semantic segmentation.
Boundary refinement (BR) is proposed as well to further refine the object boundaries.

And it is published in 2017 CVPR with more than 100 citations. (Sik-Ho Tsang @ Medium)

Outline

The Contradictory Between Classification and Localization
Global Convolutional Network (GCN) & Boundary Refinement (BR)
Ablation Study
Comparison with State-of-the-art Approaches

1. The Contradictory Between Classification and Localization

**Classification (Left), Segmentation/Localization (Middle), GCN (Right)**

For the classification task, the models are required to be invariant to various transformations like translation and rotation.
But for the localization task, models should be transformation-sensitive, i.e., precisely locate every pixel for each semantic category.

Global Convolutional Network (GCN)

To deal with the above two challenges simultaneously. Authors follow two design principles:
1) From the localization view, the model structure should be fully convolutional to retain the localization performance and no fully-connected or global pooling layers should be used as these layers will discard the localization information;
2) From the classification view, large kernel size should be adopted in the network architecture to enable densely connections between feature maps and per-pixel classifiers, which enhances the capability to handle different transformations.

2. Global Convolutional Network (GCN) & Boundary Refinement (BR)

**Overall Architecture (GCN), GCN Module (Top Right), and BR module (Bottom Right)**

As shown above, ResNet is used as backbone. Particularly, ResNet-152 pretrained on ImageNet is used during state-of-the-art comparison.
GCN module is inserted as as shown above, followed by BR module.
Score maps of lower resolution are upsampled with a deconvolution layer, then added up with higher ones to generate new score maps.

2.1. Global Convolutional Network (GCN) Module

As shown at the top right of the figure, instead of directly using larger kernel or global convolution, the GCN module employs a combination of 1×k+k×1 and k×1+1×k convolutions, which enables densely connections within a large k×k region in the feature map.
Different from the asymmetric kernels used by Inception-v3, there are no nonlinearity after convolution layers.
Compared with the trivial k×k convolution, the GCN structure involves only O(2/k) computation cost and number of parameters, which is more practical for large kernel sizes.

2.2. Boundary Refinement (BR) Module

As shown at the bottom right of the figure, the boundary alignment is modelled as a residual structure, where Ŝ=S+R(S), and S is the coarse score map, R() is the residual branch.
It can be treated as an additional residual blocks customized by authors, used after GCN and used during deconvolution process.

3. Ablation Study

PASCAL VOC 2012 has 1464 images for training, 1449 images for validation and 1456 images for testing, which belongs to 20 object classes along with one background class.
Semantic Boundaries Dataset is also used as auxiliary dataset, resulting in 10,582 images for training.
PASCAL VOC 2012 validation set are used for evaluation.

**GCN (Leftmost), 1×1 Conv (2nd Left), Trivial k×k Conv (2nd Right), Stacks of 3×3 Conv (Rightmost)**

3.1. Large Kernel Matters

**Different k values for GCN on PASCAL VOC 2012 validation set**

Base: A naive baseline using simple 1×1 Conv.
With k = 15, it is roughly equal to 16×16 feature map size.
The performance consistently increases with the kernel size k.
Especially, GCN (k = 15) surpasses the smallest one by a significant margin 5.5%.

3.2. Are more parameters helpful?

**GCN vs Trivial k×k Conv on PASCAL VOC 2012 validation set**

For trivial k×k Conv, larger kernel will result in better performance if k≤5, yet for k≥7 the performance drops.
One hypothesis is that too many parameters make the training suffer from overfit, which weakens the benefits from larger kernels.
And authors found that find trivial large kernels in fact make the network difficult to converge.
While the proposed GCN does not have this problem.

3.3. GCN vs Stacks of Small 3×3 Convolutions

**GCN vs Stacks of Small 3×3 Convolutions on PASCAL VOC 2012 validation set**

In here, non-linearity is removed for stacks of small 3×3 convolutions in order to have fair comparison with GCN.
Again, stacks of small 3×3 convolutions bring much more parameters than GCN, and results in overfitting when the receptive field is increased.

**Different Number of Feature Maps (m) on PASCAL VOC 2012 validation set**

Different Number of Feature Maps (m) is also tested in order to reduce the number of parameters for the stacks of small 3×3 convolutions.
However, its performance suffers from degradation with fewer parameters.

3.4. How GCN & BR contribute to the segmentation results?

**GCN & BR on PASCAL VOC 2012 validation set**

Pixels lying in the center of large objects may benefit more from GCN because it is very close to “pure” classification problem.
As for the boundary pixels of objects, however, the performance is mainly affected by the localization ability.
To verify the above inference, the segmentation score map is divided into two parts: a) boundary region, whose pixels locate close to objects’ boundary (distance≤7), and b) internal region as other pixels.
As shown above, BR mainly improves the accuracy in boundary regions while GCN helps to improve the accuracy in internal regions.

3.5. GCN vs ResNet

**Original** **ResNet** **Bottleneck Module (Left), and ResNet-GCN Module (Right)**

**Detailed Architecture of** **ResNet50** **and ResNet50-GCN**

**Original** **ResNet** **vs ResNet-GCN on PASCAL VOC 2012 validation set**

We might think that, at the backbone, why not replace the original ResNet block (Left) by GCN block (Right) to improve the accuracy. Authors studied the above two structures using ResNet-50.
The GCN-based ResNet is slightly poorer than original ResNet as an ImageNet classification model.
With GCN and BR, the gain is still minor.

4. Comparison with State-of-the-art Approaches

4.1. PASCAL VOC 2012

MS COCO pretrained model is used.
The training phase is split into three stages:
(1) In Stage-1, all the images from COCO, SBD and standard PASCAL VOC 2012, resulting in 109,892 images for training, are used.
(2) During the Stage-2, only SBD and standard PASCAL VOC 2012 images are used, the same as previous section.
(3) For Stage-3, only the standard PASCAL VOC 2012 dataset is used.

At stage 3, with GCN+BR, 80.3% mean IoU is obtained.
With Multi-Scale (MS) and Conditional Random Field (CRF), 81.0% mean IoU is obtained.

Finally, on test set, 82.2% mean IoU is obtained, which outperforms CRF-RNN [37], and DeepLabv1 & DeepLabv2 [6, 7].

4.2. Cityscapes

It contains 24998 images from 50 cities with different conditions, which belongs to 30 classes without background class.
The images are split into two set according to their labeling quality. 5,000 of them are fine annotated while the other 19,998 are coarse annotated. The 5,000 fine annotated images are further grouped into 2975 training images, 500 validation images and 1525 testing images.
The images in Cityscapes have a fixed size of 1024×2048, which is too large to our network architecture. Therefore we randomly crop the images into 800×800 during training phase. k of GCN is also increased from 15 to 25 as the final feature map is 25×25.
The training phase is split into two stages:
(1) In Stage-1, the coarse annotated images and the training set are mixed up, resulting in 22,973 images.
(2) For Stage-2, the network is only fine-tuned on training set.
During the evaluation phase, the images are split into four 1024×1024 crops and their score maps are fused.