Review: R-FCN — Positive-Sensitive Score Maps (Object Detection)

Sik-Ho Tsang
Towards Data Science
6 min readOct 20, 2018

--

In this story, R-FCN (Region-based Fully Convolutional Network), by Microsoft and Tsinghua University, is briefly reviewed. By positive sensitive score map, the inference time is much faster than Faster R-CNN while still maintaining competitive accuracy.

From R-CNN to R-FCN

This is a paper in 2016 NIPS with more than 700 citations while I was writing this paper. As knowing the development of object detection approaches can know much more about the reasons behind the innovation, I hope I can include more objection detection approaches in the coming future.(Sik-Ho Tsang @ Medium)

R-FCN Demonstration

What Are Covered

  1. Advantages of R-FCN Over R-CNN
  2. Positive-Sensitive Score Maps & ROI Pooling
  3. Other Details
  4. Results

1. Advantages of R-FCN Over R-CNN

R-CNN series

For traditional region proposal network (RPN) approaches such as R-CNN, Fast R-CNN and Faster R-CNN, region proposals are generated by RPN first. Then ROI pooling is done, and going through fully connected (FC) layers for classification and bounding box regression.

The process (FC layers) after ROI pooling does not share among ROI, and takes time, which makes RPN approaches slow. And the FC layers increase the number of connections (parameters) which also increase the complexity.

R-FCN

In R-FCN, we still have RPN to obtain region proposals, but unlike R-CNN series, FC layers after ROI pooling are removed. Instead, all major complexity is moved before ROI pooling to generate the score maps. All region proposals, after ROI pooling, will make use of the same set of score maps to perform average voting, which is a simple calculation. Thus, No learnable layer after ROI layer which is nearly cost-free. As a results, R-FCN is even faster than Faster R-CNN with competitive mAP.

2. Positive-Sensitive Score Maps & ROI Pooling

Positive-Sensitive Score Maps & Positive-Sensitive ROI Pooling (k=3 in this figure) (Colors are important in this diagram)

2.1 Positive-Sensitive Score Maps

Let us just take away RPN for simplicity.

And we got C classes need to be detected. (C+1) means C number of object classes plus the background class.

After numerous convolutions at the beginning right before the positive-sensitive score maps, we perform k²(C+1)-d convolution. For each class, there will be k² feature maps. These k² feature map represents the {top-left (TL), top-center (TC), .. , bottom right (BR)} of the object we want to detect.

2.2 Positive-Sensitive ROI Pooling

An Example of Positive-Sensitive ROI Pooling

When ROI pooling, (C+1) feature maps with size of k² are produced, i.e. k²(C+1). The pooling is done in the sense that they are pooled with the same area and the same color in the figure. Average voting is performed to generate (C+1) 1d-vector. And finally softmax is performed on the vector.

When the region proposal does not overlap the object so much as below, we will have voting of no:

When the region proposal does not overlap the object so much.

2.3 Bounding Box Regression

Class-agnostic bounding box regression is performed, that means the regression is shared among the classes.

At k²(C+1)-d convolutional layer, a sibling 4k²-d convolutional layer is appended. position-sensitive RoI pooling is performed on this bank of 4k² maps, producing a 4k²-d vector for each RoI. Then it is aggregated into a 4-d vector by average voting, which represents {tx,ty, tw, th} (position and size) of the bounding box, which is the same as in Fast R-CNN.

3. Other Details

3.1 Backbone Architecture

The first 100 conv of ResNet-101, pretrained from ImageNet, is used to compute the feature maps right before the Positive-Sensitive Score Maps.

3.2 Training

The loss is following Fast R-CNN:

Lcls is the classification loss and Lreg is the bounding box regression loss.

Online Hard Example Mining (OHEM) is used for training. Among N proposals, only the Top B ROIs which have the highest loss are used for backpropagation.

4-step alternative training is done as the same in Faster R-CNN to train the RPN and R-FCN.

3.3 Inference

Non-maximum suppression (NMS) is performed at 0.3 IoU for post-processing

4. Results

4.1 VOC 2007 Dataset

Study of k values

R-FCN with 7×7 ROI size obtains 76.6% mAP which is better than Faster R-CNN.

The use of OHEM

With OHEM, 79.5% mAP is obtained.

Multi-Scale Training

Using different scales of image for training, R-FCN has 83.6% mAP which is a bit worse than Faster R-CNN+++ of 85.6% mAP. But the test time of R-FCN is 0.17 sec per image, which is much faster than that of Faster R-CNN+++ (3.36 sec/image). This is because there is no FC layers after ROI pooling.

And we can see that the details of training are also crucial, which increase the mAP so much, from 76.6% to 83.6% mAP.

Different Backbones on VOC 2007

Using ResNet-152 has similar mAP as using ResNet-101. This is due to the ResNet network problem. If identity mapping is used in ResNet, it is able to go over 1000 layers and will not be saturated at 152 layers. (If interested, please also read my review on ResNet with Identity Mapping.)

Some Amazing Results on VOC 2007 Dataset

4.2 VOC 2012 & MS COCO Datasets

VOC 2012 Dataset
MS COCO Dataset

Similar to the results of VOC 2007 dataset, R-FCN has competitive but lower mAP than that of Faster R-CNN+++. But the test time of R-FCN is much much faster.

Object detection approaches can be divided into two-stage (R-CNN series with RPN) and one-stage (YOLO, SSD) approaches. And R-FCN can be treated as a kind of fast approach in two-stage approach category.

As knowing the development of object detection approaches can know much more about the reasons behind the innovation, I hope I can include more objection detection approaches in the coming future.

--

--