Training Object Detection (YOLOv2) from scratch using Cyclic Learning Rates

Santosh GSK
Towards Data Science
6 min readMar 19, 2018

--

Object detection is the task of identifying all objects in an image along with their class label and bounding boxes. It is a challenging computer vision task which has lately been taken over by deep learning algorithms like Faster-RCNN, SSD, Yolo. This post focuses on the latest Yolo v2 algorithm which is said to be fastest (approx 90 FPS on low res images when run on Titan X) and accurate than SSD, Faster-RCNN on few datasets. I will be discussing how Yolo v2 works and the steps to train. If you would like to dig deeper into the landscape of object detection algorithms you can refer here and here.

This post assumes that you have a basic understanding of Convolutional Layers, Max pooling, Batchnorm. If not, I would suggest you to get a brief idea about the topics in the links attached.

Yolo v2: You Only Look Once

In an image shown below, we need to identify the bounding boxes for the one instance of Person, Tv Monitor and Bicycle.

As per Yolo algorithm, we divide the input image into N x N (here 13x13) squares .

At each square, the Yolo network (discussed below) predicts 5 bounding boxes with different aspect ratios.

An example of 5 boxes is shown for a square positioned at (7, 9) from top left.

For each bounding box, the Yolo network predicts its central location within the square, the width, height of box wrt the image width, height and the confidence score of having any object in that box along along with the probabilities of belong to each of the M classes.

However, not every bounding box would have an object. Given these predictions, to find the final bounding boxes, we need to do the following two steps:

  1. Remove the bounding boxes which have no object. Remove the bounding boxes that predict a confidence score less than a threshold of 0.24
  2. Among the bounding boxes which claim to have a object remove the redundancy of identifying the same object using Non Max Suppression and Intersection over Union.
Predicted bounding boxes

YOLOv2 Network:

The above steps are the post-processing steps needed to get final bounding boxes after the image is passed through Yolo network. However, we haven’t discussed how the Yolo network produces that output. Here, I am going to discuss the Yolo network.

The architecture for the YOLOv2 can be visualized here. The details of each block in the visualization can be seen by hovering over the block. Each Convolution block has the BatchNorm normalization and then Leaky Relu activation except for the last Convolution block.

The Reorg layer after the Conv13_512 (refer visualization) is a reorganization layer. If the input image has dimension 3x416x416 (Channels x Height x Width — CHW), then Conv13_512 would have a output size of 512x26x26 (CHW). The reorganization layer takes every alternate pixel and puts that into a different channel. Let us take an example of a single channel with 4x4 pixels as shown below. The reorganization layer reduces the size to half and creates 4 channels with adjacent pixels in different channels. Hence, the output of Reorg layer from Conv13_512 will be 2048x13x13.

Reorg layer in YOLO v2

The concat layer takes the output of Reorg layer (size: 2048x13x13) and the output of Conv20_1024 (size: 1024x13x13) and results in a concatenated layer with size 3072x13x13.

Loss Function:

The objective function is a multi-part function as

YOLO v2 Loss function

The above function defines the loss function for an iteration t. If a bounding box doesn’t have any object then its confidence of objectness need to be reduced and it is represented as first loss term. As the bounding boxes coordinates prediction need to align with our prior information, a loss term reducing the difference between prior and the predicted is added for few iterations (t < 12800). If a bounding box k is responsible for a truth box, then the predictions need to be aligned with the truth values which is represented as the third loss term. The 𝞴 values are the pre-defined weightages for each of the loss terms.

Training the YOLOv2:

Before training YOLOv2, the authors defined an architecture, referred as Darknet-19, to train on ImageNet dataset. Darknet-19 has the same top 19 layers as YOLOv2 network (until Conv18_1024) and then appended with a 1x1 Convolution of 1024 filters followed by Global AvgPool and Softmax layers. Darknet-19 is trained on ImageNet reaching 91.2% top-5 accuracy and the trained weights till the layer Conv18_1024 are later used while training YOLOv2 network.

I have performed several experiments with SGD and Adam. I have tried using momentum, weight decay as mentioned in the paper. I couldn’t get above 65.7 mAP on test using SGD. However, Adam was able to perform better than SGD reaching 68.2 mAP.

Later, I attempted Cyclic Learning Rates with Restarts as explained in the wonderful Fast AI lectures. This is a very interesting technique as I found out that after starting the training with a learning rate found using lr_find(), the test accuracy started improving from previous results in as few as 4 epochs. The strategy I followed:

  1. Found the learning rate to be 0.00025 using lr_find technique which I reimplemented using PyTorch
  2. Trained the last layers for 5 cycles with n_epochs doubling every cycle resulting in 31 epochs. I had implemented the cyclic learning rates using CosineAnnealingLR method in PyTorch.
  3. Trained all layers with a differential learning rate for 3 cycles with epochs doubling every cycle.

With the above strategy I was able to achieve 71.8 mAP on test which is better than the strategy mentioned in paper and I was able to achieve it in lesser epochs as well. However, it is still quite far from the accuracy mentioned in the paper (76.8 mAP). I believe there are a few things that I need to try out further to get closer to the accuracies mentioned in the paper.

  1. Multi scale training as I was not able to replicate multi scale training at regular intervals in PyTorch
  2. Training for more epochs.
  3. Handling the difficult samples (width or height of bounding box < 0.02) either by ignoring or adding additional constraints in loss function.

I will be adding the code soon. Hit the clap if you like the post and please do let me know your thoughts about this finding.

I would like to thank Jeremy Howard and Rachel Thomas for their incredible effort of sharing useful tips on training neural networks.

Please find the code here: https://github.com/santoshgsk/yolov2-pytorch

--

--

Applied AI Evangelist with over 8 years of experience in Data Scientist and ML research. I advocate for responsible data products that don’t discriminate.