Review: SharpMask — 1st Runner Up in COCO Segmentation 2015 (Instance Segmentation)

A Refinement Module, An Encoder Decoder Architecture by Facebook AI Research (FAIR)

Sik-Ho Tsang
Towards Data Science

--

In this story, SharpMask, by Facebook AI Research (FAIR), is reviewed. Encoder decoder architecture was starting to be common from the year of 2016. By concatenating the feature maps at top down pass to the feature maps at bottom up pass, the performance can be boosted further.

  • Object Detection: Identify the object category and locate the position using a bounding box for every known object within an image.
  • Semantic Segmentation: Identify the object category of each pixel for every known object within an image. Labels are class-aware.
  • Instance Segmentation: Identify each object instance of each pixel for every known object within an image. Labels are instance-aware.
Object Detection (Left), Semantic Segmentation (Middle), Instance Segmentation

SharpMask obtained 2nd place in MS COCO Segmentation challenge and 2nd place in MS COCO Detection challenge. It has been published in 2016 ECCV, with over 200 citations. (Sik-Ho Tsang @ Medium)

  • Average recall on the MS COCO improves 10–20%.
  • By optimizing the architecture, speed is improved by 50% compared with DeepMask.
  • By using additional image scales, small object recall is improved by about 2 times.
  • By applying SharpMask onto Fast R-CNN, object detection results are also improved.

What Are Covered

  1. Encoder Decoder Architecture
  2. Some Details
  3. Architecture Optimization
  4. Results

1. Encoder Decoder Architecture

Architectures for Instance Segmentation

(a) The Conventional Feedforward Network

  • The network contains a series of convolutional layers interleaved with pooling stages that reduce the spatial dimensions of the feature maps, followed by a fully connected layer to generate the object mask. Hence, each pixel prediction is based on a complete view of the object, however, its input feature resolution is low due to the multiple pooling stages.
  • This network architecture is similar to the DeepMask approach.
  • DeepMask only coarsely align with the object boundaries.
  • SharpMask produce sharper, pixel-accurate object masks.

(b) Multiscale Network

  • This architecture are equivalent to making independent predictions from each network layer and upsampling and averaging the results.
  • This network architecture is similar to the FCN and CUMedVision1 approaches (Note: they are not for instance segmentation).

(c) Encoder Decoder Network & (d) Refinement Module

  • After a series of convolutions at the bottom-up pass (left side of the network), the feature maps are very small.
  • These feature maps are 3×3 convolved and gradually upsampled at the top-down pass (right side of the network) using 2× bilinear interpolation.
  • Added to this, the corresponding same-size feature maps F at the bottom-up pass are concatenated to the mask-encoding feature maps M at the top-down pass before upsampling.
  • Before each concatenation, 3×3 convolution is also performed on F, to reduce the number of feature maps, since direct concatenation is computationally expensive.
  • The concatenation has been used in many deep learning approaches as well such as the famous U-Net.
  • And authors refactored the refinement module which leads to a more efficient implementation as follows:
(a) Original (b) Refactored but equivalent model that leads to a more effcient implementation

2. Some Details

ImageNet-Pretrained 50-layer ResNet is used.

Two-stage Training

First, the model is trained to jointly infer a coarse pixel-wise segmentation mask and an object score using the feedforward path. Second, the feedforward path is `frozen’ and the refinement modules trained.

  • Faster converge can be obtained.
  • We can have a coarse mask using the forward path only, or have a sharp mask using bottom-up and top-down paths.
  • Gains of fine-tuning of whole network is minimal once the forward branch had converged.

During Full-image Inference

  • Only the most promising locations are refined. Top N scoring proposal windows are refined.

3. Architecture Optimization

It is required to reduce the complexity of the network. And it is found that DeepMask spends 40% of its time for feature extraction, 40% for mask prediction, and 20% for score prediction.

3.1. Trunk Architecture

  • Input Size W: Reducing W decreases stride density S which further harms accuracy.
  • Pooling Layers P: More pooling P results in faster computation, it also results in loss of feature resolution.
  • Stride Density S: Doubling the stride while keeping W constant greatly reduces performance
  • Depth D: Increasing D, in the context of instance segmentation, reducing spatial resolution hurts performance.
  • Feature Channels F: Adopting 1×1 convolution to reduce F and show that we can achieve large speedups in this manner.
Results for Various W, P, D, S, F
  • W160-P4-D39-F128: obtains the tradeoff between speed and accuracy.
  • The top and last rows are the timings for DeepMask and SharpMask (i.e. W160-P4-D39-F128) using multiscale inference excluding the time score prediction time respectively.
  • Total time for DeepMask and SharpMask is 1.59s and 0.76s per image respectively. That means the FPS is about 0.63 FPS and 1.32 FPS per second for DeepMask and SharpMask respectively.

3.2. Head Architecture

The head architecture also consumes certain complexity of the model.

Various Head Architecture
  • (a): Original DeepMask head architecture to obtain the mask and score.
  • (b) to (d): Various common sharing conv and fully connected layers to obtain the mask and score.
Results for Various Head ARchitectures
  • Head C is chosen due to its simplicity and time.

3.3. Number of Feature Maps in Different Conv

  • (a) Number of Feature Maps are the same for all convolutions.
  • (b) Number of Feature Maps are reduced along bottom-up path and increased back along top-down path.
  • And (b) has lower inference time and similar AUC (Average across AR at 10, 100, 1000 proposals).

4. Results

4.1. MS COCO Segmentation

Results on MS COCO Segmentation
  • DeepMask-ours: DeepMask with optimized trunk and head, better than DeepMask.
  • SharpMask: Better than previous state-of-the-art approaches
  • SharpMaskZoom & SharpMaskZoom²: With one or two additional smaller scale and achieves a large boost in AR for small objects.

4.2. Object Detection & Results in MS COCO Challenges 2015

Results on MS COCO

Top

  • By applying SharpMask on Fast R-CNN with VGGNet as backbone for feature extraction, i.e. the third row, SharpMask+VGG, it is better than Selective Search (i.e. the original Fast R-CNN) and RPN (Region Proposal Network, i.e. Faster R-CNN).

Middle

  • SharpMask+MPN (another backbone called MultiPathNet), it obtains 2nd place in MS COCO segmentation challenge.

Bottom

  • SharpMask+MPN, it obtains 2nd place in MS COCO Detection challenge, better than ION.

But at that moment, SharpMask only used VGGNet as backbone. Thus, the results were inferior.

4.3. Qualitative Results

SharpMask proposals with highest IoU to the ground truth on selected COCO images. Missed objects (no matching proposals with IoU > 0:5) are marked in red. The last row shows a number of failure cases.

By gradually upsampling, with early feature maps concatenated with late feature maps, SharpMask outperforms DeepMask.

--

--