The world’s leading publication for data science, AI, and ML professionals.

Understanding Fast-RCNN for Object Detection

The fast-RCNN paper highlights the drawbacks of SPPNet and RCNN, and build a relatively fast and accurate model

Review of Fast-RCNN

Bern, Switzerland - Image by author
Bern, Switzerland – Image by author

The Fast-RCNN model was build by overcoming the drawbacks of SPPNet and RCNN. I have written about both, and you should have a look at them before proceeding:

Understanding Regions with CNN features (R-CNN)

Understanding SPPNet for Object Detection and Classification

The blog has the same structure as that of the above two blogs, i.e., a conversation between a student and a teacher.


Teacher

We have previously seen R-CNN and SPPNet. Though these models have performed very well, there are some drawbacks to each of them. The following are the drawbacks common to both architectures:

  • Multi-stage training: A classification model is first trained on ImageNet (pre-trained weights us), then fine-tuned for the detection dataset. After fine-tuning, the softmax is replaced by a one-vs-rest classifier for object detection task (trained on hard-mined data). The performance was boosted by adding a bounding box regressor to the features of the last pooling layer. This was a multi-stage pipeline, and training happens step by step.
  • High Space and Time Complexity: After fine-tuning the network, and before training the SVM and the bounding box regressor, the features are cached to a disk to avoid repeated computations. It takes a lot of time to generate these features, as well as hundreds of GB to store them.

Drawbacks peculiar to SPPNet:

  • Inefficient to fine-tune the Convolution layer: Unlike R-CNN, the SPP layer makes it difficult to update the weights of the Convolution layer before it. Avoiding the fine-tuning of the convolution layer hinders the performance of the model.

All of the above drawbacks are resolved in the Fast-RCNN paper. As the name suggests, it’s a relatively fast version of RCNN and makes use of some of the architectural details of SPPNet.

Student

I was so amazed by the in-depth analysis shown in the R-CNN paper and the novel SPP Layer introduced in the SPPNet layer that I did not notice any of these drawbacks. Can you explain the model architecture that is used in Fast-RCNN?

Teacher

There are three sets of models that the author has provided analysis in the Fast-RCNN paper:

  • Small (S): CaffeNet model
  • VGG_CNN_M_1024 (M): Model similar to CaffeNet, but it’s wider
  • VGG16 (L): Very deep VGG-16

We will be restricting our discussion to VGG-16 (pre-trained on ImageNet), which is their deepest Network.

Fast-RCNN architecture - paper
Fast-RCNN architecture – paper

The input image is sent to the VGG-16 and is processed it till the last convolution layer (without the last pooling layer). And after that, the images are sent to the novel Region of Interest (RoI) pooling layer. This pooling layer always outputs a 7 x 7 map for each feature map output from the last convolution layer. This 7 x 7 map is produced by pooling, where the window size changes based on the input image. Incidentally, the flattened feature from this map produces the same size feature vector which is expected by the pre-trained FC-6 layer of the VGG-16. However, the last 1000 way softmax layer is replaced with a 21-way Softmax (unlike SVM in the case of RCNN and SPPNet). Also for the bounding box regressor, the branching happens from the last FC-7 layer instead of the convolution layer feature maps.

Note: The RoI pooling layer is just a special case of the SPP layer, where only one of the pyramid level is used. In this case (7 x 7). Also, the calculation for each sub-window and stride is taken from the SPPNet paper.

Student

One of the drawbacks you stated above is that, the SPP layer can not back-propagate efficiently. Please can you explain how this problem is tackled? Also, how did this architecture solve the other issues like space and time complexity?

Teacher

The back-propagation becomes in-efficient in the SPP layer when the region proposals come from different images. However, they propose an efficient method for fine-tuning the network. The use of N=2 input images, and for each image they sample R=128 RoI per image. Moreover, they take 25% of the foreground images with IoU greater than 0.5 with the ground truth bounding box. For IoU between the interval of [0.1, 0.5) with the ground truth box; these proposals are considered as background. The author claim that:

The lower threshold of 0.1 appears to act as a heuristic for hard example mining

Note: The back-propagation through the RoI pooling layer has a similar implementation as that with any normal max-pooling layer. The author has just described that mathematically. For a simpler understanding, refer to this answer.

Since we have seen how the back-propagation happens in this network, let’s also see how the issue of multi-stage training was handled. Instead of training separately, the author did the training of the bounding box regressor and the softmax layer together. They named this loss function the multi-task loss function.

Multi-task loss function - Image by author
Multi-task loss function – Image by author

In the above image:

  • class prediction (p): discrete probability distribution per RoI [p = (p0, p1, p2 … pk)] (contains k+1 classes where k = 0 is the background class)
  • class label (u): is the correct class
  • weight to each loss (λ): The value is always equal to 1
  • Iverson bracket function [u ≥ 1]: This evaluates the value to one of the classes is not the background, else zero.
  • predicted bounding box label (t): t = (tx,ty,tw,th) gives the predicted bounding box tuple for each class in u from the selected RoI image.
  • ground truth bounding box label (v): v = (vx,vy,vw,vh) gives the corresponding ground truth bound-box for the correct class in u.

Verbally, the cross-entropy loss is used for training the last 21-way softmax layer, and the smoothL1 loss handled the training of the dense layer added for the 84 regression unit handling localization of bounding box. The summation of these two losses is used to fine-tune the remaining network, and this happens along with the training of the new softmax and regression layer.

To affirm, that this new method of training does not hinder the performance, the paper has shown the following analysis.

Multi-task training analysis from paper and edited by author
Multi-task training analysis from paper and edited by author

The yellow boxes represent the training without a bounding box regressor at train or test time, whereas the red boxes show the results after using the multi-task loss function for training and fine-tuning. The remaining two columns are self-explanatory from the images. The key point to note is that the results are improved using this multi-task loss.

Student

The loss function used by them does improve the mAP on Pascal VOC 2007. So, is it due to this multi-task training, that the caching of these features is avoided thus saving us the time of generating and writing them to a disk?

Teacher

Yes, this observation is correct. Moreover, the author has provided some more information about increasing speed. It was observed that apply Singular Value Decomposition (SVD) to the FC layers, splits the operation into two matrix multiplication and reduces the computation time.

Singular Value Decomposition: Factorizing a matrix into three matrices where a diagonal matrix is sandwiched between two orthogonal matrices. The diagonal matrix tells the amount of variance present in a particular axis and is in descending order. Hence selecting the top t diagonal values from the diagonal matrix, means selecting the values which contribute the highest in the output as they have high variance. (If you are not familiar with the intuition of SVD have a look at the series of videos on linear algebra and then check this video)

SVD timing analysis from the paper
SVD timing analysis from the paper

Though the model is faster than RCNN and SPPNet, using SVD improves the time with minimal drop in mAP. For the above image, the top 1024 values were selected from the 25088 x 4096 matrix in the FC-6 layer, and the top 256 values were selected from the 4096 x 4096 FC-7 layer. The image below shows how the model performs with other models in terms of speed.

Time comparison with another model - paper
Time comparison with another model – paper

The above image could be summarized as follows:

  • The Fast-RCNN model trains 9 times faster and predicts 213 times faster then RCNN
  • The Fast Rcnn also trains 3 times faster, and predicts 10 times faster then SPPNet, and improves.

Student

Has the paper provided any analysis of their architecture?

Teacher

By implementing the hard mining strategy described a little while ago, the back-propagation becomes efficient to implement. However, which layers to fine-tune in the deep VGG-16 network were also explored and are described below.

The author found that for VGG-16, fine-tuning all the layers from conv3_1 significantly impacted that the mAP. When fine-tuning the Conv2 layers, the training speed decreased, and fine-tuning the conv1 layers exceeds the GPU memory. However, the results after training the Conv layers show a huge jump in mAP (from 61.4 to 66.9%), as compared to the ones in which they were not trained. Hence, fine-tuning the convolution layer also becomes essential, and this was a major drawback in SPPNet. All the results discussed previously are fine-tuned from conv3_1.

The author has compared their model with the contemporary models and the Fast-RCNN has outperformed them. I have shown the results for pascal VOC10. The results for pascal VOC 2007 and VOC 2012 and can be seen from the paper.

Results on Pascal VOC 2010 - paper
Results on Pascal VOC 2010 – paper

Some other observations shared in the paper are:

  • The author has provided an analysis of the effect of changing the number of proposals input to the network. It was observed that increasing the region proposals does not necessarily increase the mAP.
  • The author has also tried training and testing in a multi-scale environment where the rules of training remain the same as that of SPPNet (region proposal selected from scale which is closest to 224). They have also used the same set of scales as that of SPPNet but clipped the longest side to 2000px. It was observed, that though there is an increase in accuracy, single-scale processing offers the best trade-off between speed and accuracy.

Most important details from the paper have been discussed, but a look at the paper is always recommended. Before reading the Fast-RCNN paper, make sure to read RCNN followed by SPPNet.


References

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Computer Vision and Pattern Recognition, 2014

K. He, X Zhang, S. Ren, J Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, ECCV, 2014

R. Girshick, Fast R-CNN, ICCV, 2015


Related Articles