Evolution of YOLO — YOLO version 1

The genesis of YOLO — “You Only Look Once” object detection

Published in

Towards Data Science

6 min readMay 20, 2020

**Source:** You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.

YOLO (You Only Look Once) is one of the most popular object detector convolutional neural networks (CNNs). After Joseph Redmon et al. published their first YOLO paper in 2015, subsequent versions were published by them in 2016, 2017 and by Alexey Bochkovskiy in 2020. This article is the first in a series of articles that provide an overview of how the YOLO CNN has evolved from the first version to the latest version.

1. YOLO v1 — Motivation:

Before the invention of YOLO, object detector CNNs such R-CNN used Region Proposal Networks (RPNs) first to generate bounding box proposals on the input image, then run a classifier on the bounding boxes and finally apply post-processing to eliminate duplicate detections as well as refine the bounding boxes. Individual stages of the R-CNN network had to be trained separately. R-CNN network was hard to optimize as well as slow.

Creators of YOLO were motivated to design a single stage CNN that could be trained end to end, was easy to optimize and was real-time.

2. YOLO v1 — Conceptual design:

**Figure 1:** YOLO version 1 conceptual design (**Source:** You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

As shown in figure 1 left image, YOLO divides the input image into S x S grid cells. As show in figure 1 middle top image, each grid cell predicts B bounding boxes and an “objectness” score P(Object) indicating whether the grid cell contains an object or not. As shown in figure 1 middle bottom image, each grid cell also predicts the conditional probability P(Class | Object) of the class the object contained by the grid cell belongs to.

For each bounding box, YOLO predicts five parameters — x, y, w, h and a confidence score. The center of the bounding box with respect to the grid cell is denoted by the coordinates (x,y). The values of x and y are bounded between 0 and 1. The width w and height h of the bounding box are predicted as a fraction of the width and height of the whole image. So their values are between 0 and 1. The confidence score indicates whether the bounding box has an object and how accurate the bounding box is. If the bounding box does not have an object, then the confidence score is zero. If the bounding box has an object, then the confidence score equals Intersection Over Union (IoU) of the predicted bounding box and the ground truth. Thus for each grid cell, YOLO predicts B x 5 parameters.

For each grid cell, YOLO predicts C class probabilities. These class probabilities are conditional based on an object existing in the grid cell. YOLO only predicts one set of C class probabilities per grid cell even though the grid cell has B bounding boxes. Thus for each grid cell, YOLO predicts C + B x 5 parameters.

Total prediction tensor for an image = S x S x (C + B x 5). For PASCAL VOC dataset, YOLO uses S = 7, B = 2 and C = 20. Thus final YOLO prediction for PASCAL VOC is a 7 x 7 x (20 + 5 x 2) = 7 x 7 x 30 tensor.

Finally, YOLO version 1 applies Non Maximum Suppression (NMS) and thresholding to report final predictions as show in figure 1 right image.

3. YOLO v1 — CNN design:

**Figure 2:** YOLO version 1 CNN (**Source:** You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

YOLO version 1 CNN is depicted in figure 2. It has 24 convolution layers that act as a feature extractor. They are followed by 2 fully connected layers that are responsible for classification of objects and regression of bounding boxes. The final output is a 7 x 7 x 30 tensor. YOLO CNN is a simple single path CNN similar to VGG19. YOLO uses 1x1 convolutions followed by 3x3 convolutions with inspiration from Inception version 1 CNN from Google. Leaky ReLU activation is used for all layers except the final layer. The final layer uses a linear activation function.

4. YOLO v1 — Loss design:

Sum-squared error is the backbone of YOLO’s loss design. Since multiple grid cells do not contain any objects and their confidence score is zero. They overpower the gradients from a few cells that contain the objects. To avoid such overpowering leading to training divergence and model instability, YOLO increases the weight (λcoord = 5)for predictions from bounding boxes containing objects and reduces the weight (λnoobj = 0.5) for predictions from bounding boxes that do not contain any objects.

Figure 3: YOLO v1 loss part 1 — bounding box center coordinates (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

Figure 3 shows the first part of the YOLO loss which calculates the error in the prediction of bounding box center coordinates. The loss function only penalizes bounding box center coordinates’ error, if that predictor is responsible for the ground truth box.

Figure 4: YOLO v1 loss part 2- bounding box width and height (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

Figure 4 shows the second part of the YOLO loss which calculates the error in prediction of bounding box width and height. If the magnitude of error in prediction is the same for a small bounding box versus a large bounding box, they produce the same loss. But the same magnitude of error is more “wrong” for a small bounding box than a large bounding box. Hence, the square root of those values is used to calculate the loss. As both width and height are between 0 and 1, their square roots increase the differences for smaller values more than the larger values. The loss function only penalizes bounding box width and height error, if that predictor is responsible for the ground truth box.

Figure 5: YOLO v1 loss part 3- object confidence score (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

Figure 5 shows the third part of the YOLO loss which calculates the error in prediction of object confidence score for bounding boxes that have an object. The loss function only penalizes object confidence error, if that predictor is responsible for the ground truth box.

Figure 6: YOLO v1 loss part 4- no object confidence score. (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

Figure 6 shows the fourth part of the YOLO loss which calculates the error in prediction of object confidence score for bounding boxes that do not have an object. The loss function only penalizes object confidence error, if that predictor is responsible for the ground truth box.

Figure 7: YOLO v1 loss part 5- class probability (Source: You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

Figure 7 shows the fifth part of the YOLO loss which calculates the error in prediction of class probabilities for grid cells that have an object. The loss function only penalizes class probabilities error, if an object is present in that grid cell.

5. YOLO v1 — Results:

Figure 8: YOLO v1 — Results (**Source:** You Only Look Once: Unified, Real-Time Object Detection by Joseph Redmon et al.)

YOLO v1 results on PASCAL VOC 2007 dataset are listed in figure 8. YOLO achieves 45 FPS and 63.4 % mAP which are significantly higher compared to DPM — another real-time object detector. Though, Faster R-CNN VGG-16 has much higher mAP at 73.2%, its speed is considerably slower at 7 FPS.

6. YOLO v1 — Limitations:

YOLO has difficulties in detecting small objects that appear in groups.
YOLO has difficulties in detecting objects having unusual aspect ratios.
YOLO makes more localization errors compared to Fast R-CNN.

7. References:

[1] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, You Only Look Once: Unified, Real-Time Object Detection (2015), arxiv.org

[2] R. Girshick, J. Donahue, T.Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation (2013), arxiv.org

[3] K. Simnoyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition (2014), arxiv.org

[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, Going Deeper with Convolutions (2014) arxiv.org