Faster R-CNN (object detection) implemented by Keras for custom data from Google’s Open Images Dataset V4
Introduction
After exploring CNN for a while, I decided to try another crucial area in Computer Vision, object detection. There are several methods popular in this area, including Faster R-CNN, RetinaNet, YOLOv3, SSD and etc. I tried Faster R-CNN in this article. Here, I want to summarise what I have learned and maybe give you a little inspiration if you are interested in this topic.
The original code of Keras version of Faster R-CNN I used was written by yhenon (resource link: GitHub .) He used the PASCAL VOC 2007, 2012, and MS COCO datasets. For me, I just extracted three classes, “Person”, “Car” and “Mobile phone”, from Google’s Open Images Dataset V4. I applied configs different from his work to fit my dataset and I removed unuseful code. Btw, to run this on Google Colab (for free GPU computing up to 12hrs), I compressed all the code into three .ipynb notebooks. Sorry for the messy structure.
To start with, I assume you know the basic knowledge of CNN and what is object detection. This is the link for original paper, named “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. For someone who wants to implement custom data from Google’s Open Images Dataset V4 on Faster R-CNN, you should keep read the content below.
I read many articles explaining topics relative to Faster R-CNN. They have a good understanding and better explanation around this. Btw, if you already know the details about Faster R-CNN and are more curious about the code, you can skip the part below and directly jump to the code explanation part. This is my GitHub link for this project.
Recommendation for reading:
Faster R-CNN: Down the rabbit hole of modern object detection
Faster R-CNN (Brief explanation)
R-CNN (R. Girshick et al., 2014) is the first step for Faster R-CNN. It uses search selective (J.R.R. Uijlings and al. (2012)) to find out the regions of interests and passes them to a ConvNet. It tries to find out the areas that might be an object by combining similar pixels and textures into several rectangular boxes. The R-CNN paper uses 2,000 proposed areas (rectangular boxes) from search selective. Then, these 2,000 areas are passed to a pre-trained CNN model. Finally, the outputs (feature maps) are passed to a SVM for classification. The regression between predicted bounding boxes (bboxes) and ground-truth bboxes are computed.
Fast R-CNN (R. Girshick (2015)) moves one step forward. Instead of applying 2,000 times CNN to proposed areas, it only passes the original image to a pre-trained CNN model once. Search selective algorithm is computed base on the output feature map of the previous step. Then, ROI pooling layer is used to ensure the standard and pre-defined output size. These valid outputs are passed to a fully connected layer as inputs. Finally, two output vectors are used to predict the observed object with a softmax classifier and adapt bounding box localisations with a linear regressor.
Faster R-CNN (frcnn for short) makes further progress than Fast R-CNN. Search selective process is replaced by Region Proposal Network (RPN). As the name revealed, RPN is a network to propose regions. For instance, after getting the output feature map from a pre-trained model (VGG-16), if the input image has 600x800x3 dimensions, the output feature map would be 37x50x256 dimensions.
Each point in 37x50 is considered as an anchor. We need to define specific ratios and sizes for each anchor (1:1, 1:2, 2:1 for three ratios and 128², 256², 512² for three sizes in the original image).
Next, RPN is connected to a Conv layer with 3x3 filters, 1 padding, 512 output channels. The output is connected to two 1x1 convolutional layer for classification and box-regression (Note that the classification here is to determine if the box is an object or not).
Javier: For training, we take all the anchors and put them into two different categories. Those that overlap a ground-truth object with an Intersection over Union (IoU) bigger than 0.5 are considered “foreground” and those that don’t overlap any ground truth object or have less than 0.1 IoU with ground-truth objects are considered “background”.
In this case, every anchor has 3x3 = 9 corresponding boxes in the original image, which means there are 37x50x9 = 16650 boxes in the original image. We just choose 256 of these 16650 boxes as a mini batch which contains 128 foregrounds (pos) and 128 backgrounds (neg). At the same time, non-maximum suppression
is applied to make sure there is no overlapping for the proposed regions.
RPN is finished after going through the above steps. Then we go to the second stage of frcnn. Similar to Fast R-CNN, ROI pooling is used for these proposed regions (ROIs). The output is 7x7x512. Then, we flatten this layer with some fully connected layers. The final step is a softmax function for classification and linear regression to fix the boxes’ location.
Code explanation
Part 1: Extract annotation for custom classes from Google’s Open Images Dataset v4 (Bounding Boxes)
Download and load three .csv files
In the official website, you can download class-descriptions-boxable.csv
by clicking the red box in the bottom of below image named Class Names
. Then go to the Download from Figure Eight
and download other two files.
In the Figure Eight website, I downloaded the train-annotaion-bbox.csv
and train-images-boxable.csv
like the image below.
After downloading them, let’s look at what’s inside these files now. train-images-boxable.csv
contains the boxable image name and their URL link. class-descriptions-boxable.csv
contains the class name corresponding to their class LabelName. train-annotations-bbox.csv
has more information. Each row in the train-annotations-bbox.csv
contains one bounding box (bbox for short) coordinates for one image, and it also has this bbox’s LabelName and current image’s ID (ImageID+’.jpg’=Image_name). XMin, YMin
is the top left point of this bbox and XMax, YMax
is the bottom right point of this bbox. Please note that these coordinates values are normalised and should be computed for the real coordinates if needed.
Get the subset of the whole dataset
The whole dataset of Open Images Dataset V4 which contains 600 classes is too large for me. So I extract 1,000 images for three classes, ‘Person’, ‘Mobile phone’ and ‘Car’ respectively.
After downloading these 3,000 images, I saved the useful annotation info in a .txt file. Each row has the format like this: file_path,x1,y1,x2,y2,class_name (no space just comma between two values) where file_path is the absolute file path for this image, (x1,y1) and (x2,y2) represent the top left and bottom right real coordinates of the original image, class_name is the class name of the current bounding box. I used 80% images for training and 20% images for testing. The expected number of training images and testing images should be 3x800 -> 2400 and 3x200 -> 600. However, there might be some overlapped images which appear in two or three classes simultaneously. For instance, an image might be a person walking on the street, and there are several cars in the street. So the number of bboxes for training images is 7236, and the number of bboxes for testing images is 1931.
Part 2: Faster R-CNN code
I will explain some main functions in the codes. The complete comments for each function are written in the .jpynb notebooks. Note that I keep the resized image to 300 for faster training instead of 600 that I explained in the Part 1.
Rebuild the structure of VGG-16 and load pre-trained model (nn_base
)
Prepare training data and training labels (get_anchor_gt
)
The input data is from annotation.txt file which contains a bunch of images with their bounding boxes information. We need to use RPN method to create proposed bboxes.
- Arguments in this function
all_img_data: list(filepath, width, height, list(bboxes))
C: config
img_length_calc_function: function to calculate final layer’s feature map (of base model) size according to input image size
mode: ‘train’ or ‘test’; ‘train’ mode need augmentation - Returns value in this function
x_img: image data after resized and scaling (smallest size = 300px)
Y: [y_rpn_cls, y_rpn_regr]
img_data_aug: augmented image data (original image with augmentation)
debug_img: show image for debug
num_pos: show number of positive anchors for debug
Calculate rpn for each image (calc_rpn)
If feature map has shape 18x25=450 and anchor sizes=9, there are 450x9=4050 potential anchors. The initial status for each anchor is ‘negative’. Then, we set the anchor to positive if the IOU is >0.7. If the IOU is >0.3 and <0.7, it is ambiguous and not included in the objective. One issue is that the RPN has many more negative than positive regions, so we turn off some of the negative regions. We also limit the total number of positive regions and negative regions to 256. y_is_box_valid
represents if this anchor has an object. y_rpn_overlap
represents if this anchor overlaps with the ground-truth bounding box.
For ‘positive’ anchor, y_is_box_valid
=1, y_rpn_overlap
=1.
For ‘neutral’ anchor, y_is_box_valid
=0, y_rpn_overlap
=0.
For ‘negative’ anchor, y_is_box_valid
=1, y_rpn_overlap
=0.
- Arguments in this function
C: config
img_data: augmented image data
width: original image width (e.g. 600)
height: original image height (e.g. 800)
resized_width: resized image width according to C.im_size (e.g. 300)
resized_height: resized image height according to C.im_size (e.g. 400)
img_length_calc_function: function to calculate final layer’s feature map (of base model) size according to input image size - Returns value in this function
y_rpn_cls: list(num_bboxes, y_is_box_valid + y_rpn_overlap)
y_is_box_valid: 0 or 1 (0 means the box is invalid, 1 means the box is valid)
y_rpn_overlap: 0 or 1 (0 means the box is not an object, 1 means the box is an object)
y_rpn_regr: list(num_bboxes, 4*y_rpn_overlap + y_rpn_regr)
y_rpn_regr: x1,y1,x2,y2 bunding boxes coordinates
The shape of y_rpn_cls
is (1, 18, 25, 18). 18x25 is feature map size. Each point in feature map has 9 anchors, and each anchor has 2 values for y_is_box_valid
and y_rpn_overlap
respectively. So the fourth shape 18 is from 9x2.
The shape of y_rpn_regr
is (1, 18, 25, 72). 18x25 is feature map size. Each point in feature map has 9 anchors and each anchor has 4 values for tx
, ty
, tw
and th
respectively. Note that these 4 value has their own y_is_box_valid
and y_rpn_overlap
. So the fourth shape 72 is from 9x4x2.
Calculate region of interest from RPN (rpn_to_roi)
- Arguments in this function (num_anchors = 9)
rpn_layer: output layer for rpn classification
shape (1, feature_map.height, feature_map.width, num_anchors)
Might be (1, 18, 25, 9) if resized image is 400 width and 300
regr_layer: output layer for rpn regression
shape (1, feature_map.height, feature_map.width, num_anchors*4)
Might be (1, 18, 25, 36) if resized image is 400 width and 300
C: config
use_regr: Wether to use bboxes regression in rpn
max_boxes: max bboxes number for non-max-suppression (NMS)
overlap_thresh: If iou in NMS is larger than this threshold, drop the box - Returns value in this function
result: boxes from non-max-suppression (shape=(300, 4))
boxes: coordinates for bboxes (on the feature map)
For 4050 anchors from above step, we need to extract max_boxes
(300 in the code) number of boxes as the region of interests and pass them to the classifier layer (second stage of frcnn). In the function, we first delete the boxes that overstep the original image. Then, we use non-max-suppression with 0.7 threshold value.
RoIPooling layer and Classifier layer (RoiPoolingConv, classifier_layer)
RoIPooling layer is the function to process the roi to a specific size output by max pooling. Every input roi is divided into some sub-cells, and we applied max pooling to each sub-cell. The number of sub-cells should be the dimension of the output shape.
Classifier layer is the final layer of the whole model and just behind the RoIPooling layer. It’s used to predict the class name for each input anchor and the regression of their bounding box.
- Arguments in this function
base_layers
: vgg
input_rois
: `(1,num_rois,4)` list of rois, with ordering (x,y,w,h)
num_rois
: number of rois to be processed in one time (4 in here) - Returns value in this function
list(out_class, out_regr)
out_class
: classifier layer output
out_regr
: regression layer output
First, the pooling layer is flattened.
Then, it’s followed with two fully connected layer and 0.5 dropout.
Finally, there are two output layers.
# out_class: softmax activation function for classifying the class name of the object
# out_regr: linear activation function for bboxes coordinates regression
Dataset
Again, my dataset is extracted from Google’s Open Images Dataset V4. Three classes for ‘Car’, ‘Person’ and ‘Mobile Phone’ are chosen. Every class contains around 1000 images. The number of bounding boxes for ‘Car’, ‘Mobile Phone’ and ‘Person’ is 2383, 1108 and 3745 respectively.
Parameters
- Resized (im_size) value is 300.
- The number of anchors is 9.
- Max number of non-max-suppression is 300.
- Number of RoI to process in the model is 4 (I haven’t tried larger size which might speed up the calculation but more memory needed)
- Adam is used for optimisation and the learning rate is 1e-5. It might works different if we applied the original paper’s solution. They used a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset.
- For images augmentation, I turn on the horizontal_flips, vertical_flips and 90-degree rotations.
Environment
Google’s Colab with Tesla K80 GPU acceleration for training.
Training time
The length of each epoch that I choose is 1000. Note that every batch only processes one image in here. The total number of epochs I trained is 114. Every epoch spends around 700 seconds under this environment which means that the total time for training is around 22 hours. If you are using Colab’s GPU like me, you need to reconnect the server and load the weights when it disconnects automatically for continuing training because it has a time limitation for every session.
Result
There are two loss functions we applied to both the RPN model and Classifier model. As we mentioned before, RPN model has two output. One is for classifying whether it’s an object and the other one is for bounding boxes’ coordinates regression. From the figure below, we can see that it learned very fast at the first 20 epochs. Then, it became slower for classifier layer while the regression layer still keeps going down. The reason for this might be that the accuracy for objectness is already high for the early stage of our training, but at the same time, the accuracy of bounding boxes’ coordinates is still low and needs more time to learn.
The similar learning process is shown in Classifier model. Compared with the two plots for bboxes’ regression, they show a similar tendency and even similar loss value. I think it’s because they are predicting the quite similar value with a little difference of their layer structure. Compared with two plots for classifying, we can see that predicting objectness is easier than predicting the class name of a bbox.
This total loss is the sum of four losses above. It has a decreasing tendency. However, the mAP (mean average precision) doesn’t increase as the loss decreases. The mAP is 0.15 when the number of epochs is 60. The mAP is 0.19 when the number of epochs is 87. The mAP is 0.13 when the number of epochs is 114. I think this is because of the small number of training images which leads to overfitting of the model.
Other things we could tune
- For a shorter training process. I choose 300 as
im_size
for images resized instead of 600 in the original code (and original paper). So I choose a smalleranchor_size
[64, 128, 256] instead of [128, 256, 512]. - I choose VGG-16 as my base model because it has a simpler structure. However, the model like ResNet-50 might have a better result for its better performance on image classification.
- There are many thresholds in the model. I used most of them as original code did.
rpn_max_overlap=0.7
andrpn_min_overla=0.3
is the range to differentiate ‘positive’, ‘neutral’ and ‘negative’ for each anchor.overlap_thresh=0.7
is the threshold for non-max-suppression.
Test on images
In the notebook, I splitted the training process and the testing process into two parts. Please reset all runtimes as below before running the test .ipynb notebook. And maybe you need to close the training notebook when running test notebook, because the memory usage is almost out of limitation.
At Last
To have fun, you can create your own dataset that is not included in Google’s Open Images Dataset V4 and train them. For the cover image I use in this article, they are three porcoelainous monks made by China. I just named them according to their face look (not sure about the sleepy one). They are not included in the Open Images Dataset V4. So I use RectLabel to annotate by myself. I spent around 3 hours to dragged the ground-truth boxes for 6 classes with 465 images (including ‘Apple Pen’, ‘Lipbalm’, ‘Scissor’, ‘Sleepy Monk’, ‘Upset Monk’ and ‘Happy Monk’). For the anchor_scaling_size
, I choose [32, 64, 128, 256] because the Lipbalm is usually small in the image. To find these small square lip balms. I added a smaller anchor size for a stronger model. Considering the Apple Pen is long and thin, the anchor_ratio could use 1:3 and 3:1 or even 1:4 and 4:1 but I haven’t tried. The training time was not long, and the performance was not bad. I guess it’s because of the relatively simple background and plain scene. Actually, I find out that the harder part is not to annotate this dataset but to think about how to photograph them to make the dataset more robust.
Alright, that’s all for this article. Thanks for your watching. This is my GitHub link for this project. Go ahead and train your own object detector. If you have any problem, please leave your review. I’m glad to hear from you :)