Object detection consists of two separate tasks that are classification and localization. R-CNN stands for Region-based Convolutional Neural Network. The key concept behind the R-CNN series is region proposals. Region proposals are used to localize objects within an image. In the following blogs, I decided to write about different approaches and architectures used in Object Detection. Therefore, I am happy to start this journey with R-CNN based object detectors.
Working Details

As can be seen in the image above before passing an image through a network, we need to extract region proposals or regions of interest using an algorithm such as selective search. Then, we need to resize (wrap) all the extracted crops and pass them through a network.
Finally, a network assigns a category from C + 1, including the ‘background’ label, categories for a given crop. Additionally, it predicts delta Xs and Ys to shape a given crop.
Extract region proposals
Selective Search is a region proposal algorithm used for object localization that groups regions together based on their pixel intensities. So, it groups pixels based on the hierarchical grouping of similar pixels. In the original paper, the authors extract about 2,000 proposals.
Positive vs. negative examples
After we extract our region proposal, we also have to label them for training. Therefore, the authors label all the proposals having IOU of at least 0.5 with any of the ground-truth bounding boxes with their corresponding classes. However, all other region proposals that have an IOU of less than 0.3 are labelled as background. Thus, the rest of them are simply ignored.
Bounding-box regression

The image above shows deltas that are to be predicted by CNN. So, x, y are centre coordinates. whereas w, h are width and height respectively. Finally, G and P stand for ground-truth bounding box and region proposal respectively. It is important to note that the bounding box loss is only calculated for positive samples.
Loss
The total loss is calculated as the sum of classification and regression losses. However, there is a coefficient lambda for the latter one, which is 1,000 in the original paper. Note that the regression loss is ignored for negative examples.
Architecture
Typically, we pass the resized crops through VGG 16 or ResNet 50 in order to get features. They are subsequently passed through fully connected layers that output predictions.
if you want to see the full code, you can easily find a Jupiter Notebook on my GitHub.
Some Last Words

I trained it for only 5 epochs, so as you can see it is able to detect some of the balloons in the image. There are several drawbacks to why it is not used anymore. The largest disadvantage is the selective search algorithm used for proposal extraction. Considering that the algorithm is executed on cpu, the inference time becomes slow. Additionally, all the proposals have to be resized and passed through the network, which also adds an overhead. Therefore, I am going to write about other algorithms that were introduced to overcome these problems.
Paper
Rich feature hierarchies for accurate object detection and semantic segmentation
Related Articles
R-CNN, Fast R-CNN, Faster R-CNN, YOLO – Object Detection Algorithms