YOLO (You Only Look Once)

A simple explanation of a real-time object detection algorithm!

Kevin Rexis Velasco
Towards Data Science

--

King Aragorn before a charge against the Orcs of Mordor

Seattlites, Imagine this scenario:

It is a warm, summer afternoon at Greenlake Park. The park is full of many other people trying to soak up all the sun they can get during our relatively short summers. You are laying out on the grass working on your trying to even out the recent sock tan from the long, arduous hike the weekend before. While reading your favorite book, you hear collective shouts of “HEADS UP!!!!” from a group near you. In the corner of your eye, you see a little ball coming towards your way, which suddenly becomes bigger and bigger with each millisecond that passes. Instinctively, you drop the book and are able to locate the ball and catch it before it hits you! Superb!

Teaching a machine how to ‘see’ the world is not an easy task. It’s not as simple as connecting a camera to a computer. It takes a fraction of a second for us to see, identify, analyze, classify, and finally act upon our vision(in this case, raising your hands to predict the path of the rogue ball). But for a machine, recreating human vision isn’t just one action, it is actually a set of them. Computer vision (CV) is a growing and ever-changing interdisciplinary computer science field that is focused on how to make computers can be made to perceive the world as a living organism on Earth would.

Both hardware and software advances have improved over the last few years. Image segmentation, facial recognition, and pattern detection are some examples of computer vision. In this blog post, I hope to give a simple overview of YOLO (You Only Look Once), a fast real-time multi-object detection algorithm, which was first outlined in this 2015 paper by Redmon et al. from the University of Washington (go DAWGS!) which has now had many improvements proposed since its first inception.

First, below is a youtube video of YOLO v2 in action.

From the YOLO git hub readme:

You only look once (YOLO) is a system for detecting objects on the Pascal VOC 2012 dataset. It can detect the 20 Pascal object classes:

person

bird, cat, cow, dog, horse, sheep

aeroplane, bicycle, boat, bus, car, motorbike, train

bottle, chair, dining table, potted plant, sofa, tv/monitor

Cool, right? But what? how?

You Only Look Once is an algorithm that utilizes a single convolutional network for object detection. Unlike other object detection algorithms that sweep the image bit by bit, the algorithm takes the whole image and

…reframe(s) the object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities.

To put it simply without diving into the nitty-gritty details, you take an image input, split it up on a SxS grid, pass it through a neural network to create bounding boxes and class predictions to determine the final detection output. It is first trained multiple instances over an entire dataset before being tested on real-life images and video.

To calculate the bounding boxes, YOLO implements two key post-processing steps: IoU (Intersect over Union) and NMS (Non-maximum suppression).

The IoU is how well the machine’s predicting bounding box matches up with the actual object’s bounding box. Take for example the image of the car below:

car

The purple box is what the computer thinks is the car, while the red is the actual bounding box of the car. The overlap of the two boxes gives us our IoU.

The shaded yellow is our IoU

Below is a simple example of how NMS operates. Object detection algorithms often have the issue of over-identifying a certain object, in this case, Audrey Hepburn’s face. Non-maximum suppression (NMS) has been widely used in several key aspects of computer vision and is an integral part of many proposed approaches in detection, might it be edge, corner or object detection [1–6]. Its necessity stems from the imperfect ability of detection algorithms to localize the concept of interest, resulting in groups of several detections near the real location. NMS ensures we identify the optimal cell among all candidates where the face belongs. Rather than determining that there are multiple cases of her face in the image, NMS chooses the highest probability of the boxes that are determining the same object.

An example of NMS

Utilizing both IoU and NMS, it creates a prediction of the various objects in an image extremely fast. By seeing the entire image during training and test time it implicitly encodes contextual information about classes as well as their appearances in the images it sees. However, one drawback of YOLO is the inability to detect multiple objects that are either too close or too small, like in the example below where the groups of people under the building structures are unclassified by YOLO.

via TechnoStacks

One great example of how this technology can be implemented in real life is in automobile vision! As a vehicle travels through a street, what it ‘sees’ is in constant flux, and by the quick YOLO algorithm, the car will be able to quickly identify the cyclist below. With other sensors to detect how far away that cyclist, the car is able to take the necessary action to stop or avoid the cyclist or other cars or objects to avoid a collision!

I hope this high-level overview of the YOLO algorithm has sparked your interest in the current state of where the technology of computer vision is. IF you are interested in learning more about further research, please check out Joseph Redmon and his continued work on YOLO and other computer vision projects!

--

--