The world’s leading publication for data science, AI, and ML professionals.

Object Detection Explained: R-CNN

Region-based Convolutional Neural Network

Matt Artz via Unsplash
Matt Artz via Unsplash

Object detection consists of two separate tasks that are classification and localization. R-CNN stands for Region-based Convolutional Neural Network. The key concept behind the R-CNN series is region proposals. Region proposals are used to localize objects within an image. In the following blogs, I decided to write about different approaches and architectures used in Object Detection. Therefore, I am happy to start this journey with R-CNN based object detectors.

Working Details

RCNN: Working Details. Source: https://arxiv.org/pdf/1311.2524.pdf.
RCNN: Working Details. Source: https://arxiv.org/pdf/1311.2524.pdf.

As can be seen in the image above before passing an image through a network, we need to extract region proposals or regions of interest using an algorithm such as selective search. Then, we need to resize (wrap) all the extracted crops and pass them through a network.

Finally, a network assigns a category from C + 1, including the ‘background’ label, categories for a given crop. Additionally, it predicts delta Xs and Ys to shape a given crop.

Extract region proposals

Selective Search is a region proposal algorithm used for object localization that groups regions together based on their pixel intensities. So, it groups pixels based on the hierarchical grouping of similar pixels. In the original paper, the authors extract about 2,000 proposals.

Positive vs. negative examples

After we extract our region proposal, we also have to label them for training. Therefore, the authors label all the proposals having IOU of at least 0.5 with any of the ground-truth bounding boxes with their corresponding classes. However, all other region proposals that have an IOU of less than 0.3 are labelled as background. Thus, the rest of them are simply ignored.

Bounding-box regression

Bounding-box regression. Source: https://arxiv.org/pdf/1311.2524.pdf.
Bounding-box regression. Source: https://arxiv.org/pdf/1311.2524.pdf.

The image above shows deltas that are to be predicted by CNN. So, x, y are centre coordinates. whereas w, h are width and height respectively. Finally, G and P stand for ground-truth bounding box and region proposal respectively. It is important to note that the bounding box loss is only calculated for positive samples.

Loss

The total loss is calculated as the sum of classification and regression losses. However, there is a coefficient lambda for the latter one, which is 1,000 in the original paper. Note that the regression loss is ignored for negative examples.

Architecture

Typically, we pass the resized crops through VGG 16 or ResNet 50 in order to get features. They are subsequently passed through fully connected layers that output predictions.

if you want to see the full code, you can easily find a Jupiter Notebook on my GitHub.

Some Last Words

Balloon Dataset
Balloon Dataset

I trained it for only 5 epochs, so as you can see it is able to detect some of the balloons in the image. There are several drawbacks to why it is not used anymore. The largest disadvantage is the selective search algorithm used for proposal extraction. Considering that the algorithm is executed on cpu, the inference time becomes slow. Additionally, all the proposals have to be resized and passed through the network, which also adds an overhead. Therefore, I am going to write about other algorithms that were introduced to overcome these problems.

Paper

Rich feature hierarchies for accurate object detection and semantic segmentation

Related Articles

R-CNN, Fast R-CNN, Faster R-CNN, YOLO – Object Detection Algorithms

Understanding Selective Search for Object Detection


Related Articles