Detecting Objects in (almost) Real-time: FasterRCNN Explained with Code

Published in

Towards Data Science

8 min readMar 1, 2018

FasterRCNN is a network that does object detection. As explained by its name, its faster than its descendants RCNN and FastRCNN. How fast? Almost realtime fast. This network has use cases in self-driving cars, manufacturing, security, and is even used at Pinterest.

The paper came out in 2015 and there are many great posts explaining it how works (examples: #1, #2, #3). So I wanted to explore what it was doing with the public implementation provided by this repo: https://github.com/jwyang/faster-rcnn.pytorch

Let’s learn about FastRCNN through an example. Let’s look at a given image X.

The Slower Process, using FastRCNN, would go like this:
1) Use an algorithm like Selective Search on the image to generate interesting boxes/regions
2) Run the image through a CNN to get a Feature Map
3) For each of the boxes generated from step 1, use the Feature Map and several fully connected layers to output class + Bounding Box coordinates(will expand more later)

How FasterRCNN works:
1) Run the image through a CNN to get a Feature Map
2) Run the Activation Map through a separate network, called the Region Proposal Network(RPN), that outputs interesting boxes/regions
3) For the interesting boxes/regions from RPN use several fully connected layer to output class + Bounding Box coordinates

The difference here is that FasterRCNN solved the bottleneck of having to run Selective Search for each image as the first step. Most implementations of Selective Search are on the CPU(slow) and it’s essentially a separate algorithm, so let’s get rid of it!

The intuition is that: With FastRCNN we’re already computing an Activation Map in the CNN, why not run the Activation Map through a few more layers to find the interesting regions, and then finish off the forward pass by predicting the classes + bbox coordinates?

Okay, with that, let’s look at some code.

The codebase implements FasterRCNN with both Resnet101 and VGG16. I’ll explain with VGG16 because of the architecture’s simplicity. The first step is to define the network as RCNN_base, RCNN_top. RCNN_base is to do step 1, extract the features from the image. RCNN_top is the rest of the network, which usually uses the extracted features to classify/predict stuff.

Now that we have the base feature map, we pass it to the Region Proposal Network, which is suppose to find interesting regions. It returns regions of interest, the loss regarding whether it found an object, and the loss regarding the location of the object.

Okay, so what is going on with the RPN?

Before we talk about this, we have to define Anchors, which are different sized boxes for helping detect object of various sizes(e.g. humans, vehicles…etc) In the default configurations, there are 3 scales and 3 ratios, which make for 9 total anchors. At each spot in our feature map, we’re going to run these 9 anchors (more on that later).

Picture Source: Vahid Mirjalili https://www.quora.com/How-does-the-region-proposal-network-RPN-in-Faster-R-CNN-work

Okay let’s get back to figuring out what RPN does.

The first line in rpn.py takes the base feature map, and whatever dimension it is, scales it to a fixed dimension of 512 using a 3x3 kernel.

The result is then sent to two different convolution layers: Let’s call them the Class Layer and BBox Coord Layer
1 )Class predicts 2 class probabilities(object is present / no object) for each anchor
2) BBox Coord predicts 4 coordinates of a bounding box relative to each anchor

So say a given example of image was 600 x 891. By the time it’s done with RCNN_base, it’ll be shaped batch size x 1024 x 38 x 56. Then, it’ll be normalized by the 1st rpn conv to be batch size x 512 x 38 x 56.

As you can see above, for each spot in our 38 x 56 feature map:
1) The result of the Class Conv Layer outputs 18 values for each spot. This makes sense because the 18 values correspond to 2 class probabilities for (present / missing) * 9 anchors. The result is batch size x 18 x 38 x 56

2) The result of the BBOX Layer outputs 36 values for each spot. This makes sense because 36 values correspond to 4 bounding box coordinates * 9 anchors.

Note for Class Probabilities:
The class probabilities are 0 -> 1 because a soft max function is applied

Note for BBox Coefficients:
The bounding box coordinates are values [0, 1] that are relative to a specific anchor. x, y, w, and h denote the box’s center coordinates and its width and height. The * values = ground truth. For example, t_x denotes the coefficient for x(center of box). You can multiply t_x by w_a and then add x_a to get the predicted x. Vice versa for the rest of these.

So the RPN calculates the class probabilities and bbox coefficients, I thought it was suppose to find regions of interests?

Right ok, so if you have a feature map of 38 x 56 x 9 anchors, sampled at a stride of 16, that’s alot of proposals! (38 x 56 x 9) / 16 = 1192 proposals for a single image. Obviously we can’t keep all of them, there’s probably only a few interesting things on an image.

Thus, the RPN sorts the proposals to find the ones with the corresponding highest probability. Because we have so many proposals, there’s bound to be ones that recognize the same object. We apply Non-max suppression to only keep the most confident proposal, and remove everything else that has an IOU > 0.7.

https://www.pyimagesearch.com/wp-content/uploads/2014/10/hog_object_detection_nms.jpg

Great. So RPN produces the proposals, how to evaluate rest of network?

So once RPN produces the regions of interests, these ROI’s might be differently sized. We’re planning to feed these ROI’s into another CNN that expects a fixed size, so we have to do Region of Interest Pooling.

By the end of Pooling we want to turn of feature map of size 38 x 56 x 512 into size 7 x 7 x 512. There are three approaches to doing this: ROI Crop, ROI Align, ROI Pool. In the original paper, they used ROI Pooling, which divides the feature map into 7 x 7 sections, and gets the max values for each section.

However, as explained in this video, it seems like ROI Pooling loses some information due to uneven division. The actual default implementation is ROI Align. To read more about ROI Align, check out the Mask-RCNN paper which uses it and does a even much harder job of detecting objects by labeling its pixels.

Okay, now that we have the 7x7 feature map called pooled_feat, we pass it to RCNN_top we defined earlier! Seen below all the _head_to_tail function is doing is first flattening the 7x7 to 1x49 and then passing it to RCNN_top.

For reference this is what RCNN_top looks like. It’s two fc layers. The first layer takes in 25088 features because our feature map’s shape 7x7x512 = 25088.

Final Step!

We take the 4096 features and run it through two separate Fully Connected Layers to get the class scores and Bounding Box Predictions.

Training the Network

Faster RCNN is composed of two different networks: the Region Proposal Network which does the proposals, and the Evaluation Network which takes the proposals and evaluates classes/bbox. These two networks have two different objectives so you would have to train them a bit differently.

For the Region Proposal Network, you want to teach it to get better at making these proposals. In the code there is an anchor target class that takes Ground Truth boxes and gets the corresponding class scores/bbox coefficients. If an anchor and ground truth’s IOU overlap is over 0.7, the anchor target gets a “1” class and its bbox coefficients are the gt’s coefficients. If it’s less than 0.3, it’s a negative example.

For classification we use Cross Entropy Loss, which measures the performance of a classification model whose output is 0 to 1.

For Bbox regression we use a smooth L1 loss, which is the absolute value between the prediction and ground truth. Their reasoning is that L1 loss is less sensitive to outliers compared to losses like L2, which square the error.

Once the RPN Network loss functions are defined, we define two other loss functions for the Evaluation Layer. The logic is the exact same except the cross-entropy loss is between multiple classes: “cats, dogs, turkey” instead of “present / not present”. The bbox regression is the same.

In Pytorch, you can define one loss as multiple losses combined, so this is combining the two losses for the RPN, and two losses for the Evaluation Layer. Then you can optimize both layers at the same time.

In the paper however, they use a training scheme that alternates between training the RPN and the evaluation layer, but that’s an optional detail.

That’s all for FasterRCNN. Thanks for taking your time to read my article!