Review: CRAFT — Cascade Region-proposal-network And FasT r-cnn (Object Detection)

Better Object Proposals, More Accurate Object Classification, Outperforms Faster R-CNN

Sik-Ho Tsang
Towards Data Science

--

In this story, CRAFT, by Chinese Academy of Sciences and Tsinghua University, is reviewed. In Faster R-CNN, region proposal network (RPN) is used to generate proposals. These proposals, after ROI pooling, are going through network for classification. However, it is found that there is a core problem in Faster R-CNN:

  • In proposal generation, there are still a large proportion of background regions. The existence of many background samples causes many false positives.

In CRAFT, as shown above, another CNN is added after RPN to generate fewer proposals (i.e. 300 here). Then, classification is performed on these 300 proposals, and outputs about 20 primitive detection results. For each primitive result, a refined object detection is performed using one-vs-rest classification. It is published in 2016 CVPR with over 50 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Cascade Proposal Generation
  2. Cascade Object Classification
  3. Ablation Study
  4. Results

1. Cascade Proposal Generation

1.1. Baseline RPN

  • An ideal proposal generator should generate as few proposals as possible while covering almost all object instances. Due to the resolution loss caused by CNN pooling operation and the fixed aspect ratio of sliding window, RPN is weak at covering objects with extreme scales or shapes.
Recall Rates (%), Overall is 94.87%, Lower than 94.87% is bold in text.
  • The above results are the baseline RPN based on VGG_M trained using PASCAL VOC 2007 train+val, and tested on test set.
  • The recall rate on each object category varies a lot. Objects with extreme aspect ratio and scale are hard to be detected, such as boat and bottle.

1.2. Proposed Cascade Structure

The additional classification network after RPN is denoted as FRCN Net here
  • An additional classification network that comes after the RPN.
  • The additional network is a 2- class detection network, denoted as FRCN net in the above figure. It uses the output of RPN as training data.
  • After the RPN net is trained, the 2000 primitive proposals of each training image are used as training data for FRCN net.
  • During training, positive and negative sampling are based on 0.7 IoU for positives and below 0.3 IoU for negatives respectively.
  • There are two advantages:
  • 1) First, the additional FRCN net further improves the quality of the object proposals and shrinks more background regions, making the proposals fit better with the task requirement.
  • 2) Second, proposals from multiple sources can be merged as the input of FRCN net so that complementary information can be used.

2. Cascade Object Classification

2.1. Baseline Fast R-CNN

Fast R-CNN Results (Orange: Train, Red: Boat, Blue: Potted Plant)
  • It is weak at capturing intra-category variance as the “background” class usually occupies a large proportion of training samples.
  • As shown above figure, the mis-classification error is a major problem in the final detections.

2.2. Proposed Cascade Structure

Cascade Object Classification
  • To ameliorate the problem of too many false positive caused by mis-classification, the one-vs-rest classifier is used as an additional two-class cross-entropy loss for each object category, as shown above.
  • Each one-vs-rest classifier sees proposals specific to one particular object category (also containing some false positives), making it focused at capturing intra-category variance.
  • A standard FRCN net (FRCN-1) is first trained using object proposals from the cascade proposal structure.
  • Then, another FRCN net (FRCN-2) is trained based on the output of FRCN-1, which is primitive detections.
  • The primitive detections, which are classified as “background”, are discarded.
  • The sum of N 2-class cross-entropy losses is used where N equals the number of object categories.
  • The convolution weights of FRCN-1 and FRCN-2 are shared so that the full-image feature maps need only be computed once.
  • The new layers to produce 2N scores and 4N bounding box regression targets are initialized from a gaussian distribution.
  • Therefore, at test time, with 300 object proposals as input, FRCN-1 outputs around 20 primitive detections, each with N primitive scores.
  • Then each primitive detection is again classified by FRCN-2 and the output scores (N categories) is multiplied with the primitive scores (N categories) in a category-by-category way to get the final N scores for this detection.

3. Ablation Study

3.1. Proposal Generation

Recall Rates (%)
  • VGG-19 pretrained on ILSVRC DET train+va1 is used, and tested on val2.
  • The proposed FRCN, using positive and negative sampling based on above 0.7 IoU and below 0.3 IoU respectively, has the highest recall rate of 92.37%, which higher than RPN with more than 2%.
  • And the proposed FRCN, which uses 300 proposals, is better than Selective Search (SS) which uses 2000 proposals.
Recall Rates (%) and mAP (%) on PASCAL VOC 2007 Test Set
  • RPN proposals aren’t so well localized compared with bottom-up methods (low recall rates at high IoU thresholds).
  • Using a larger network cannot help (RPN_L) because it is caused by fixed anchors.
  • “Ours” keeps fixed number of proposals per image (same as RPN), while “Ours_S” keeps proposals whose scores (output of the cascaded FRCN classifier) are above a fixed threshold.
  • The cascaded proposal generator not only further eliminates background proposals, but also brings better localization, both help in detection AP.

3.2. Object Classification

mAP (%) on PASCAL VOC 2007 Test Set
  • “the same”: means no fine-tuning. Similar mAP with the one without cascade classification structure. It is just like running FRCN-1 twice which is an iterative bounding box regression.
  • “clf”: Fine-tuning the additional one-vs-rest classification weights. mAP is improved to 66.3%.
  • “fc+clf”: Fine-tuning all layers after the last convolutional layers. mAP is 68.0% which has the best results.
  • “conv+fc+clf”: It just like totally training new feature representation, learning another classifier.
mAP (%) on PASCAL VOC 2007 Test Set
  • If one-vs-rest is to replace the original classification, mAP become worse which only got 46.1%.
  • If cascade classification is used, mAP is improved to 68.0%.

4. Results

4.1. PASCAL VOC 2007 & 2012

mAP (%) on PASCAL VOC 2007 and 2012
  • FRCN: Fast R-CNN.
  • RPN_un: Faster R-CNN with unshared CNNs between proposal network and classifier network.
  • RPN: Faster R-CNN.
  • CRAFT: With cascade proposal network, it is better than RPN_un in VOC 2007 but worse than RPN. With cascade classifier network as well, it is better than Faster R-CNN in both VOC 2007 and VOC 2012.
CRAFT on PASCAL VOC 2007 Test Set

4.2. ILSVRC Object Detection Task

Recall Rate (%) on ILSVRC val2 Set
  • 0.6 NMS: A stricter NMS, better than the basic one.
  • re-score: Re-scoring each proposal by considering both scores from two stages of cascade structure also helps.
  • +DeepBox: Fusion DeepBox proposals with the RPN proposals as the fusion input to the FRCN net boosts the recall rate to be over 94%. It is better than +SS.
mAP (%) on ILSVRC val2 Set
  • Here, GoogLeNet model with batch normalization is used.
  • ILSVRC 2013train + 2014train + val1 are used as training set.
  • With cascade proposal network, 47.0% mAP is achieved, which already surpasses the ensemble result of previous state-of-the-art systems like Superpixel Labeling and DeepID-Net.
  • With also cascade classifier network, 48.5% mAP, an additional 1.5% absolute gain.

With cascaded network applies to both regional proposal network and classifier network, the detection accuracy is improved.

--

--