Review: YOLOv1 — You Only Look Once (Object Detection)

Published in

Towards Data Science

6 min readOct 17, 2018

In this story, YOLOv1 by FAIR (Facebook AI Research) is reviewed. The network only looks the image once to detect multiple objects. Thus, it is called YOLO, You Only Look Once.

**YOLOv1 without Region Proposals Generation Steps**

By just looking the image once, the detection speed is in real-time (45 fps). Fast YOLOv1 achieves 155 fps. This is another state-of-the-art deep learning object detection approach which has been published in 2016 CVPR with more than 2000 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

Below is the YOLOv1 example provided by authors:

YOLO Watches Sports

If interested, they also provide other YOLOv1 examples:

What Are Covered

Unified Detection
Network Architecture
Loss Function
Results

1. Unified Detection

1.1 Prior Arts: R-CNN

Prior Arts like R-CNN, first generates 2K region proposals (bounding box candidates), then detect object within the each region proposal as below:

**R-CNN (First, extract region proposals; Then detect object for each region proposals)**

As two steps are required, they are much slow.

1.2 YOLO

YOLO suggests to have unified network to perform all as once. Also, end-to-end training network can be achieved.

The input image is divided into an S×S grid (S=7). If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Each grid cell predicts B bounding boxes (B=2) and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object, i.e. any objects in the box, P(Objects).

Each bounding box consists of 5 predictions: x, y, w, h, and confidence.

The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell.
The width w and height h are predicted relative to the whole image.
The confidence represents the Intersection Over Union (IOU) between the predicted box and any ground truth box.

Each grid cell also predicts conditional class probabilities, P(Classi|Object). (Total number of classes=20)

Below illustrate the output of the network:

The output size becomes: 7×7×(2×5+20)=1470

2. Network Architecture

The model consists of 24 convolutional layers followed by 2 fully connected layers. Alternating 1×1 convolutional layers reduce the features space from preceding layers. (1×1 conv has been used used in GoogLeNet for reducing number of parameters.)

Fast YOLO fewer convolutional layers (9 instead of 24) and fewer filters in those layers. The network pipeline is summarized like below:

Therefore, we can see that, the input image goes through the network once and then objects can be detected. And we can have end-to-end learning.

3. Loss Function

3.1 Loss Function Explanations

There are 5 terms in the loss function as shown above.

1st term (x, y): The bounding box x and y coordinates is parametrized to be offsets of a particular grid cell location so they are also bounded between 0 and 1. And the sum of square error (SSE) is estimated only when there is object.
2nd term (w, h): The bounding box width and height are normalized by the image width and height so that they fall between 0 and 1. SSE is estimated only when there is object. Since small deviations in large boxes matter less than in small boxes. square root of the bounding box width w and height h instead of the width and height directly to partially address this problem.
3rd term and 4th term (The confidence) (i.e. the IOU between the predicted box and any ground truth box): In every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects, and makes the model unstable. Thus, the loss from confidence predictions for boxes that don’t contain objects, is decreased, i.e. λnoobj=0.5.
5th term (Class Probabilities): SSE of class probabilities when there is objects.
λcoord: Due to the same reason mentioned in 3rd and 4th terms, λcoord = 5 to increase the loss from bounding box coordinate predictions.

3.2 Other Details

Except for the final layer, all other layers use leaky ReLU as activation function. First 20 convolutional layers are pretrained by ImageNet to have 88% top-5 accuracy for a week. Then the network is trained for about 135 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012. When testing on 2012 we also include the VOC 2007 test data for training. A batch size of 64 is used. Dropout at the first fully connected layer and data augmentation are also used.

On PASCAL VOC, 98 bounding boxes per image are predicted.

Some large objects or objects near the border of multiple cells can be well localized by multiple cells at the same time. Non-maximal suppression is used.

4. Results

4.1 VOC 2007

YOLO: 63.4 mAP (mean average prediction) and 45 FPS. Compared with DPM, R-CNN, Fast R-CNN and Faster R-CNN, YOLO can obtain real-time performance with similar mAP.
Fast YOLO: 52.7% mAP and 155 FPS. With such high FPS, compared with 100Hz DPM, it has very high mAP as well.
YOLO VGG-16: YOLO using VGG016 architecture, due to absence of 1×1 convolution to reduce the model size, it is slow which has only 21 FPS even it has 66.4% mAP.

Object Localization: YOLO struggles to localize objects correctly compared with Fast R-CNN.

Background Error: YOLO has fewer background error. Fast R-CNN has 13.6% that the top detections are false positive.

As YOLO and Fast R-CNN have their pros and cons, they can be combined to have higher accuracy.

**Model combination with Fast R-CNN (VOC 2007)**

After combining the model, 75.0% mAP is achieved which has relatively higher accuracy compared with other combinations.

4.2 VOC 2012

Fast R-CNN + YOLO has 70.7% mAP which is one of the highest performing detection methods.

4.3 Generalizability

Person detection on artwork is also tried on the Picasso dataset and People-Art dataset. On Picasso models are trained on VOC 2012 while on People-Art they are trained on VOC 2010.

R-CNN has high AP on VOC 2007. However, R-CNN drops off considerably when applied to artwork. R-CNN uses Selective Search for bounding box proposals which is tuned for natural images.
DPM maintains its AP well when applied to artwork. Prior work theorizes that DPM performs well because it has strong spatial models of the shape and layout of objects.
YOLO has good performance on VOC 2007 and its AP degrades less than other methods when applied to artwork. Artwork and natural images are very different on a pixel level but they are similar in terms of the size and shape of objects, thus YOLO can still predict good bounding boxes and detections.

Some visualization results are shown below which is quite interesting: