YOLO — You Only Look Once

A State of the Art Algorithm for Real-Time Object Detection System

Published in

Towards Data Science

9 min readMay 30, 2020

Introduction

We take the help of our Eyes to see everything, it captures the information in the frame and sends it to our brain to decode and draw meaningful inferences from it. Well, it sounds pretty simple, right? We just look and we get the understanding of what are all the objects we’re looking at, how they’re placed, and tons of other information about them. But the processing out brain does for it is just beyond comparison.

Image credit :Victor Freitas, Unsplash.com

This interesting ability of the brain led the Researches to think that hey what if we can give this ability to a machine. With this the task of the machine will get much simplified, once it can recognize the objects in it’s surrounding it can interact better with them and that’s the whole aim of improving machines, to make them more human friendly, to make them more human-like.

Well in that pursuit, there is one big hurdle. How do we make the machine to identify an object? That’s what gave rise to the domain of Computer Vision that we call “Object Detection”. Object detection is a field of Computer Vision and Image Processing that deals with detecting instances of various classes of objects (like a person, book, chair, car, bus, etc.) in a digitally captured Image or Video.

This domain is further divided into sub-domains like Face detection, Activity recognition, Image annotation, and many more. Object Detection has found its applications in various important areas like Self-Driving cars, robots, Video Surveillance, Object Tracking, etc.

Challenges

1. Variable Number of Objects

Object Detection is the problem of locating and classifying a variable number of objects in an Image. The important thing is the “variable” part. The number of objects to be detected might vary from image to image. So the main problem associated with this is that in Machine Learning models, we usually need to represent the data in fixed-sized vectors. Since the number of objects in the Image is unknown to us beforehand, we would not know the correct number of outputs and we may require some post-processing which adds up the complexity.

2. Multiple Spatial Scales and Aspect Ratios

The Objects in the Images are of multiple spatial scales and aspect ratios, there may be some objects that cover most of the image and yet there will be some we may want to find but are as small as a dozen pixels (or a very small percentage of the Image). Even the same objects can have different Scales in different images. These varying dimensions of objects pose a difficulty in tracking them down. Some algorithms use the concept of sliding windows for the purpose but it is very inefficient.

3. Modeling

Doing object detection requires solving two approaches at once-Object Detection and Object Localization. Not only we want to classify the object but we also want to locate it inside the Image. To address these, most of the researches use multi-task loss functions to penalize both misclassification errors and localization errors. Due to this duality behavior of the loss function, many times it ends up performing poorly in both.

4. Limited Data

The limited amount of annotated data currently available for object detection is another hurdle in the process. Object detection datasets typically contain annotated examples for about dozen to a hundred classes while image classification datasets can include up to 100,000 classes. Gathering the ground truth labels along with the bounding boxes for each class is still a very tedious task to solve.

5. Speed for Real-Time detection

Object detection algorithms need to be not only accurate in predicting the class of the object along with its location, but it also needs to be incredibly fast in doing all these things to coup-up with the needs of the real-time demands of video processing. Usually, a video is shot at almost 24 fps and to build an algorithm that can achieve that frame rate is quite a difficult task.

Various Approaches

We considered some of the chief challenges in the domain of Object detection, what do they mean and how they affect the process. Now we’ll take a look at some models that have tried to solve these challenges before unveiling the best of them — YOLO Algorithm.

1. Fast R-CNN

Fast R-CNN is an improved version of R-CNN as it had some disadvantages like multistage pipelining environment, space and time expensive, slow object detection. To remove them Fast R-CNN introduced a new structure.

Fast R-CNN Architecture, source: Fast R-CNN

It takes the entire image as an input along with the object proposals. Initially the algorithm runs a CNN on the input image and forms a feature map by the use of various conv and max pooling layers. After that for each object proposal a Region of Interest (RoI) pooling layer extracts a fixed length feature vector and inputs it into a Fully Connected (FC) Layer. This layer further branches into two output layers : one producing the SoftMax probability for each class along with a “background” class, the other layer outputs four real numbers for each of the classes which define the bounding box for that specific class.

2. Single Shot MultiBox Detector (SSD)

SSD works on the approach of free-forward convolution layer, that outputs the fixed sized collection of the bounding boxes and the scores for the instance of an object class to be present in those bounding boxes. It also uses a Non-Max Suppression to produce the final decisions.

SSD Architecture, source: SSD: Single Shot MultiBox Detector

The Architecture of SSD is quite simple. The initial layers in the model are the standard ConvNet layers used for Image classification, which in their terminology is the Base network, building up on this base network they then add some auxiliary layers to produce the detections keeping in mind the Multi-scale feature maps, default boxes and aspect ratio.

3. Retina-Net

Retina-Net is a single, unified network composed of a backbone network and two task-specific subnetworks. The backbone is responsible for computing a conv feature map over an entire input image and is an off-the-self convolution network. The first subnet performs classification on the backbones output; the second subnet performs convolution bounding box regression.

Retina-Net Architecture, source: Focal Loss for Dense Object Detection

It uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture to generate a rich, multi-scale convolutional feature pyramid which is then fed to the two subnets where one classifies the anchor boxes and the other performs regression from the anchor boxes to the ground-truth anchor boxes.

YOLO Algorithm

Until now, we saw some very famous and well performing architectures for Object detection. All these algorithms solved some problems mentioned in the beginning of this article but fail on the most important one — Speed for real-time object detection.

YOLO algorithm gives a much better performance on all the parameters we discussed along with a high fps for real-time usage. YOLO algorithm is an algorithm based on regression, instead of selecting the interesting part of an Image, it predicts classes and bounding boxes for the whole image in one run of the Algorithm.

To understand the YOLO algorithm, first we need to understand what is actually being predicted. Ultimately, we aim to predict a class of an object and the bounding box specifying object location. Each bounding box can be described using four descriptors:

Center of the box (bx, by)
Width (bw)
Height (bh)
Value c corresponding to the class of an object

Along with that we predict a real number pc, which is the probability that there is an object in the bounding box.

YOLO doesn’t search for interested regions in the input image that could contain an object, instead it splits the image into cells, typically 19x19 grid. Each cell is then responsible for predicting K bounding boxes.

Here we take K=5 and predict possibility for 80 classes

An Object is considered to lie in a specific cell only if the center co-ordinates of the anchor box lie in that cell. Due to this property the center co-ordinates are always calculated relative to the cell whereas the height and width are calculated relative to the whole Image size.

During the one pass of forwards propagation, YOLO determines the probability that the cell contains a certain class. The equation for the same is :

Probability that there is an object of certain class ‘c’

The class with the maximum probability is chosen and assigned to that particular grid cell. Similar process happens for all the grid cells present in the image.

After computing the above class probabilities, the image may look like this :

This shows the before and after of predicting the class probabilities for each grid cell. After predicting the class probabilities, the next step is Non-max suppression, it helps the algorithm to get rid of the unnecessary anchor boxes, like you can see that in the figure below, there are numerous anchor boxes calculated based on the class probabilities.

To resolve this problem Non-max suppression eliminates the bounding boxes that are very close by preforming the IoU (Intersection over Union) with the one having the highest class probability among them.

It calculates the value of IoU for all the bounding boxes respective to the one having the highest class probability, it then rejects the bounding boxes whose value of IoU is greater than a threshold. It signifies that those two bounding boxes are covering the same object but the other one has a low probabilty for the same, thus it is eliminated.

Once done, algorithm finds the bunding box with next highest class probabilities and does the same process, it is done until we are left with all the different bounding boxes.

After this, almost all of our work is done, the algorithm finally outputs the required vector showing the details of the bounding box of the respective class. The overall architecture of the algorithm can be viewed below :

YOLO Architecture, source: You Only Look Once: Unified, Real-Time Object detection

Also, the most important parameter of the Algorithm, its Loss function is shown below. YOLO simultaneously learns about all the four parameters it predicts (discussed above).

Loss function for YOLO, source: You Only Look Once: Unified, Real-Time Object detection

So this was all about the YOLO Algorithm. We discussed all the aspects of Object detection along with the challenges we face in that domain. We then saw some of the algorithms that tried to solve some of these challenges but were failing in the most crucial one-Real time detection (speed in fps). We then studied the YOLO algorithm which outperforms all the other models in terms of the challenges faced, its fast-can work well in real-time object detection, follows a regression approach.

Still improvements are being made in the algorithm. We currently have four generations of the YOLO Algorithm from v1 to v4, along with a slightly small version of it YOLO-tiny, it is specifically designed to achieve a incredibly high speed of 220fps.

I hope I was able to clear your understanding about the algorithm and the concepts related to Object detection. If you found this article informative you can follow me for more such articles in the future. Happy Learning!