The world’s leading publication for data science, AI, and ML professionals.

Faster R-CNN: A Step towards Real-Time object Detection

Faster R-CNN introduces a novel Region Proposal Network which makes it feasible to achieve real-time performance with the existing…

Review of Faster-RCNN

Photo by XPS on Unsplash
Photo by XPS on Unsplash

In this post, we will review Faster-RCNN, a model build by replacing the Selective search in Fast-RCNN with a Novel Region Proposal Network, which makes use of Convolution Features for object detection. This post will explain the design and intuition of the Region Proposal Network and then discuss the resulting improvement as compared to previous works. It is recommended to have background knowledge of R-CNN, SPP-Net, and Fast-RCNN.

Also, the structure of the blog is more like a conversation between student and teacher (Trying to learn with the Feynman technique 😃 ). The questions from the student side are highlighted in bold (just in case you are in a rush).


Student

Till now, we have looked at R-CNN, SPPNet, and Fast-RCNN. All of these had introduced some novel ideas and improved both the time and accuracy of prediction. But on contemplating them in detail, I found that most of the changes were only the Deep Learning model’s pooling layer, and some were in the training process. However, all of the above papers use the same algorithm of Selective Search for hinting the deep learning the presence of an object. Is it the state of the art method for evaluating the presence of the object, or are there any algorithms to improve them?

Teacher

A method like Selective Search(SS) takes around 2 seconds on a CPU to hint object boxes. Other methods like Edge boxes(EB) are relatively faster taking around 0.2 seconds on a CPU but degrades the accuracy. One of the major contributions from the Faster-RCNN paper is the object proposal network called the – Region Proposal Network (RPN). This novel network only proposes regions and sends those proposals for the detection of the object.

The Object Detection model used by them is similar to Fast-RCNN where, the region proposals, from the novel RPN network, are mapped from the image to the feature map and then sent to the Fully Connected layer after passing the RoI pooling layer. In our discussion, we will call this network detection network. The authors have called this network Fast-RCNN only.

For now, we will discuss the architecture of the RPN, and we will be discussing it using VGG-16 pre-trained on ImageNet. The feature maps from unpooled Conv5_3 are used as image features in the RPN. A sliding window of size n x n (Faster-RCNN uses n = 3) is passed over this feature map to extract features. These extracted features are then sent to two siblings convolutionally implemented fully connected layer. One of them is used for classification which predicts the presence of an object (does not classify objects, only detects its presence) ** and another one produces the bounding box regression. Intuitively, this network will tell whether the center of the 3 x 3 windows contains an object or no and if it does then the regression layer tells us the center, height, and width of the object w.r.t the input image**.

Building block for RPN (Incomplete representation) - Image by Author
Building block for RPN (Incomplete representation) – Image by Author

Student

I like the idea of this network, but one question arises in my mind. If many objects have close centers and different sizes in the image, then the model will fail to predict them because when we map the center from the image to the feature map, many of the neighboring pixels will have the same receptive field in the output feature map. Do the authors address this issue?

Teacher

Naive solutions to this problem would be to use a pyramid of images, but this would induce a lot of computation. Another approach could be to use a sliding window of different sizes n = 3, 5, etc. (pyramid of features) but again, this will lead to repetitive computation and are expensive solutions. The authors come up with the idea of anchor boxes to solve the problem you just highlighted.

The paper proposes k anchor boxes, having aspect ratios- 1:1, 2:1, and 1:2. To detect objects of different scales, they change the scale of the anchor boxes such that the areas of each of them are 128², 256², and 512². Because of 3 scales and 3 aspect ratios, we get k = 9 anchor boxes. ** Let‘s understand the changes that are made to the above-described layers and how these anchor boxes are used. Firstly, the authors modified the classification layers of RPN from 2 class (object and background) to 2k class classification layer where 2 classes for each anchor boxes. Similarly, the 4 unit regression layer is changed to 4k units, (4 units for each anchor box)**.

Region Proposal Network (RPN) - from paper
Region Proposal Network (RPN) – from paper
  • Each of the two units per anchor box tells us object probability and background probability
  • The regression layer predicts the difference in displacement in the center of the object in the image, and also the change in height and width. However, this change is defined w.r.t to the height and width of the anchor box.
  • Thus, having anchor boxes of different sizes and scales makes it possible to detect multiple objects at the same location.

Student

Since we are only changing the aspect ratio of anchor, how are these different from the pyramid of features? And also, if an object is detected in the anchor of a smaller scale, then they are present in the anchor of a larger scale, then which anchor is marked as positive and which as negative?

Teacher

I would first answer the question about labeling the dataset.

  • Anchors with the highest Intersection over Union (IoU) with the ground-truth box is labeled as positive
  • Anchors with IoU ≥ 0.7 with the ground truth are labeled as positive
  • For negative labeling, the anchors should have a maximum IoU of 0.3 with the ground-truth box

Almost every time the second condition is used thus giving us more than one anchor box labeled as positive for a single object. But authors state that in some conditions, they find that some anchors do not fulfill the criteria of 0.7 and thus use the first condition. Due to this, sometimes the same object is assigned multiple anchor boxes. Also, authors have to use Non-max Suppression (NMS) on the classification scores with an IoU threshold of 0.7, but more on this later. Let’s first have a look at the training.

Loss function for RPN - from paper
Loss function for RPN – from paper

As we can see, the overall loss function is a sum of normalized classification and regression loss and weighing each of them with a parameter lambda The paper states that the mini-batch size was 256 and hence N_cls = 256 and the N_reg ~= 2400. Thus by using lambda = 10 we are giving almost equal weights to both the loss. Let’s see what are there p and t terms in the function. This function is applied to each anchor {i} for a particular image.

  • pᵢ: is defined as the prediction from the classification layer for a particular anchor
  • pᵢ*: is defined as the ground truth label for that anchor, the method for labeling was just discussed. 1 if the object is present and 0 otherwise.
  • The classification loss is just the log-loss between the above quantities.
  • The regression loss is applied only if the ground truth label for that anchor i.e., `pᵢ* = 1`
  • For regression loss, the authors used different formula as shown below
Loss calculation for center - from paper
Loss calculation for center – from paper

In the above image, x*, xₐ, and x represent ground-truth, anchor, and predicted center for x (similarly for y, h, and w). We can see from the above image that tx, ty, represents the displacement of the center as predicted by the model, and tx*, ty* represents the displacement of the center w.r.t the ground truth. For the height and width th and tw represent the predicted height and width and the th* and tw* represent the ground truth representation. It is on these numbers that the smooth-L1 loss is applied.

Loss calculation for height and width - from paper
Loss calculation for height and width – from paper

Note: One important thing to note is that the authors have used a linear scale for training the center and the log-space for training the height and width.

Regarding training of the model, they use a mini-batch size of 256, with an equal number of positive and negative samples. In a scenario having less number of positive samples, they pad add more negative ones. The authors fine-tune all the layers from conv3 and also train the new fully connected convolution layers described above.

Now the answer to your question as to how this approach is different from the pyramid of features comes from completely different training. In a pyramid of features, each feature scale applies back-propagation to the convolution network. However, in the case of anchor boxes, the back-propagation to convolution layers is applied only through the 3 x 3 window.

Student

The RPN network seems to be a pretty robust network to hint at the presence of an object in an image. Moreover, this new region proposal network should be faster than the selective search and should produce less than 2000 region proposals. If this criterion is not met, then selective search itself shall be a more appropriate algorithm for proposing regions. Please can you explain how this novel RPN network solves the above two conditions?

Teacher

As I had said earlier, this new region proposal network only replaces the selective search algorithm from the Faster-RCNN keeping the detection model the same. However, there is one thing about the training that I wish to bring to your notice. For training, the authors initially used two VGG Network and then removed one of them. Here are the steps that they followed:

  • Step 1: Trained the Region Proposal Network(RPN) by fine-tuning one of the VGG-16 models and after Conv3 layer and training the newly added layers based on anchor boxes.
  • Step 2: In this step, the detection network is trained. For the training, the RPN network trained in Step-1 proposes regions over the feature map which are sent to the RoI pooling layer (refer Fast-RCNN) and then to the FC layer. However, the VGG-16 used for this step is different from the previous one. All the layers after conv3 are fine-tuned
  • Step 3: Now, the VGG network which was initialized in step 2 is used to fine-tune the newly added layers of the RPN network after freezing the convolution layers.
  • Step 4: This fine-tuned RPN network is again used to fine-tune the FC layers of the detection network.

Note: The reason for training only after conv3 layers is discussed in the Fast-RCNN paper.

The obvious question that arises is what’s the need for such a complex training method? However, authors have tried various other methods for training and found this to be producing a good result with quick convergence. I would suggest having a look at the other methods that the authors have described as well.

Now about the speed of this algorithm, since the convolution operation for the RPN and the detection network are shared, the time taken by the RPN network is minimized. Here is what the authors said about the speed of detection.

Our system with VGG-16 takes in total 198ms for both proposal and detection

As compared to the previous network, only the selective search takes around 1–2 seconds, so yes, there is a considerable improvement in the speed of object detection, let alone the region proposal.

During training and testing time, the authors have re-scaled the shorter side to 600px. Thus for a typical 1000×600 input image, we get 20000 region proposals. However, they have been reduced by a Non-max Suppression (NMS) based on their objectness score, keeping the IoU threshold of 0.7. We can further control the number of region proposals by selecting the proposals having the highest scores only. Hence, the NMS provides us a way to control the number of region proposals to the network. I hope I have answered your questions!

Student

The RPN network seems to have worked beyond expectations in terms of speed. Can you shed some light on the performance of the model on the Pascal Dataset?

Teacher

Till now, all our discussions happened only around the VGG-16 network, but the authors have also used the ZF-Net. Most of the details about training and other implementation remain the same, for more details about ZF-Net have a look at the paper. On comparing with the same model, trained using Selective search and Edge boxes, we see an improvement an mAP on Pascal VOC 2007 when trained with RPN.

The authors took the comparison a step forward, where they trained the network using SS and tested it using RPN and saw the model’s mAP dropping only by a really small amount with ZF net and improvement while using VGG-16. This affirms that the RPN provides us with rich region proposals. They also removed the classification and regression layer of the RPN alternately and saw a drop in mAP, thus confirming the hypothesis that the layers are essential for the better performance of RPN. The words "shared" and "unshared" in the image below refer to the context of training where shared implies model trained up to step 4 and unshared implies model trained up to step 2. One thing to note here is that shared training produces much better results than using two different networks.

Comparing results on Pascal VOC with Selective Search - from paper
Comparing results on Pascal VOC with Selective Search – from paper

Note: Non-Max Suppression was used to control the region proposals of RPN in the test time followed by selecting top-N ranked proposals. In the case of no NMS, the top 6000 ranked proposals were selected.

Pascal VOC 2007 results— from paper
Pascal VOC 2007 results— from paper

The authors have shared the detection results on Pascal VOC 2007, 2012 as well, but I have shown only for 2007. The paper also discusses its result for the COCO dataset, and the authors also tried changing the VGG with ResNet and have provided an explanation for the same. The authors have also tried changing the no of anchor box by playing around with aspect ratio and scale and have added the results in the papers. I would recommend reading the paper as some more observations have been discussed there.


References

S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Computer Vision and Pattern Recognition, 2015

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Computer Vision and Pattern Recognition, 2014

K. He, X Zhang, S. Ren, J. Sun, Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition, ECCV, 2014

R. Girshick, Fast R-CNN, ICCV, 2015


Related Articles