Review: InstanceFCN — Instance-Sensitive Score Maps (Instance Segmentation)

Fully Convolutional Network (FCN), With Instance-Sensitive Score Maps, Better than DeepMask, Competitive with MNC

Sik-Ho Tsang
Towards Data Science

--

In this story, InstanceFCN (Instance-sensitive Fully Convolutional Networks), by Microsoft Research, Tsinghua University, and University of Science and Technology of China, is shortly reviewed.

By using Fully Convolutional Network (FCN), Instance-Sensitive Score Maps are introduced and all Fully Connected (FC) layers are removed. Competitive results of instance segment proposal on both PASCAL VOC and MS COCO are obtained. It is published in 2016 ECCV with more than 100 citations. (Sik-Ho Tsang @ Medium)

What Are Covered

  1. Network Structure
  2. Instance-Sensitive Score Maps
  3. Ablation Study
  4. Results

1. Network Structure

Network Structure
  • VGG-16 pretrained on ImageNet is used as feature extractor. Max pooling layer pool4 is modified from stride 2 to stride 1. Accordingly conv5_1 to conv5_3 are adjusted by “hole algorithm”, which was used by DeepLab & DilatedNet before, in order to decrease the output stride, i.e. increase the output feature map size.
  • On top of the feature map, there are two fully convolutional branches, one for estimating segment instances and the other for scoring the instances.

Instance-sensitive score maps branch

  • For the first branch (top path), we adopt a 1×1 512-d convolutional layer to transform the features, and then use a 3×3 convolutional layer to generate a set of k² instance-sensitive score maps, which is k² output channels. (k=5 finally.)
  • An assembling module is used to generate object instances in a sliding window of a resolution m×m. (m=21 here.)
  • The idea is very similar to that of positive-sensitive score maps in R-FCN. But R-FCN uses positive-sensitive score maps for object detection while InstanceFCN uses instance-sensitive score maps for generating proposals.

Objectness score map branch

  • For the second branch of scoring instances (bottom path), we use a 3×3 512-d convolutional layer followed by a 1×1 convolutional layer. This 1×1 layer is a per-pixel logistic regression for classifying instance/not-instance of the sliding window centered at this pixel. Thus, it is a objectness score map.

Loss function

  • Here i is the index of a sampled window, pi is the predicted objectness score of the instance in this window, and pi is 1 if this window is a positive sample and 0 if a negative sample. Si is the assembled segment instance in this window, Si is the ground truth segment instance, and j is the pixel index in the window. L is the logistic regression loss.
  • 256 sampled windows have a positive/negative sampling ratio of 1:1.

2. Instance-Sensitive Score Maps

2.1. Compared with FCN

Top: FCN, Bottom: InstanceFCN (k=3)
  • In FCN (Top), when two persons are too close, the score map generated is difficult to make them separated.
  • However, using InstanceFCN (Bottom), each score map is responsible for capturing relative position of object instance. For example: the top-left score map is responsible for capturing top-left part of object instance. After assembling, a separated person mask can be generated.
  • Some examples of instance masks with k=3 as shown below:
Some examples of instance masks with k=3

2.2. Compared with DeepMask

DeepMask
  • In DeepMask, FC layers are used, which makes model large.
  • In InstanceFCN, there are no FC layers which makes model more compact.

3. Ablation Study

Average Recall with Different k
  • Average Recall (AR) is measured under 10, 100, 1000 proposals.
  • k=5 and k=7 are comparable. And k=5 in the following experiments.
Train and Test Image Sizes
  • ~DeepMask: DeepMask implemented by authors. Using 2 FC layers requires 53M parameters. (512 × 14 × 14 × 512 + 512 × 56² = 53M)
  • It is found that using full-size images for training has the much higher AR. And the last k²-d convolutional layer has only 0.1M parameters. (512 × 3 × 3 × 25 = 0.1M)

4. Results

4.1. PASCAL VOC 2012

Segment Proposals on PASCAL VOC 2012 Validation Set
Recall vs IoU
  • InstanceFCN is much better SS (Selective Search) and DeepMask.
  • InstanceFCN has higher AR than MNC at AR@10, and also comparable with MNC at AR@100 and AR@1000.
Instance Segmentation on PASCAL VOC 2012 Validation Set (N = 300 Proposals)

Using InstanceFCN to generate proposals for MNC, it has comparable mAP with MNC. (MNC is a concurrent work at that moment.)

4.2. MS COCO

Segment Proposals on the First 5k Images of MS COCO Validation Set
Recall vs IoU
Comparisons with DeepMask on MS COCO Validation Set
More Examples on MS COCO validation set

--

--