Real-Time Object Detection with YOLO

Published in

Towards Data Science

4 min readFeb 14, 2020

This is Tesla’s self-driving Model S.

Now besides the fact that even the idea of a self-driving car was unheard of until very recently, one has to wonder how does a self-driving car actually work? I mean we are trusting our lives — and the lives of our friends and family with what is essentially a bundle of metal and circuits. The answer, unfortunately, is very complicated and includes a variety of components such as radar, LiDAR, GPS localization and more. Nevertheless, at the core of the self-driving car’s brain is YOLO Object Detection.

Now in my last article, I talked about convolutional neural networks and how they allow computers to classify objects in an image. However, to have something like Tesla’s autopilot work, the car needs to not only recognize what images it’s seeing but also know exactly where those objects are. For this, classifiers aren’t enough — and it’s where YOLO comes in.

Ultra-Fast, Real-Time Object Detection

To clarify, YOLO Object Detection does not stand for “You Only Live Once” Object Detection (I would not want to get in a car using that software), but rather “You Only Look Once” Object Detection. Whereas traditional classifiers need to slide windows over the image multiple times to detect different classes, YOLO — as expected — need only look once. It is an extremely fast, real-time method of not only identifying but localizing objects up to an astounding 155 frames per second.

Bounding Boxes and Grids

YOLO works by applying a single neural network to the full image input. The network divides each input image into an S by S grid and each grid cell predicts a predetermined number of bounding boxes. These boxes predict the x coordinate, y coordinate, width, and height of the object. The boxes also include a confidence score of how sure they are the object is of the class identified.

Intersection over Union Model Evaluation

To evaluate the model’s performance, YOLO uses something called Intersection over Union (IoU), calculated by divided the area of overlap between the predicted bounding box and the ground truth by the area of union between the two. IoU is a measure of how accurate the YOLO network is in identifying the objects as it compares the output of the network to hand-labelled data.

Non-Maximal Suppression

Ultimately, each grid cell ends up giving us a set number of bounding boxes with a confidence score and classification for each. Most of these bounding boxes tend to be useless or irrelevant, so only those above a certain confidence threshold are kept as the final outputs of the YOLO network.

Benefits of YOLO Objection Detection

Although there have been many predecessors before it, no computer vision algorithm has been as fast or effective as YOLO in real-time object detection. To recap, YOLO is:

1. Extremely Fast

Now at this point, you’re probably tired of me hitting you over the head with the fact that YOLO is really really fast, but it is a point that I cannot stress enough. One of the primary obstacles to applying artificial intelligence to real-life situations is that computers for the most part, just take too long in the real world. We simply cannot have self-driving cars running around rogue if every time, they need to stop and think for a few seconds whether it is a child in front of them or a plastic bag! Computer vision was — for the most part — always accurate, but YOLO allows computer vision to be much more applicable and practical in real-life scenarios due to its speed.

2. Contextually Aware

YOLO is also trained on entire images during training, meaning that it doesn’t look at specific objects in isolation. It not only encodes information about the appearance of a class but contextual information about it as well. In this way, YOLO networks are barely bothered by potential background noise being mistaken for an object. The network knows that cars are more often than not surrounded by other cars and pavement — so that shadow that kind of looks like a car in the forest? It’s probably not a car.

3. A Generalized Network

Finally, YOLO is a highly generalized network because of its architecture and the way it is trained. YOLO doesn’t just learn how to identify objects in isolation but learns generalized representations of them. When applied to unexpected inputs and unfamiliar situations, it is much less likely to break down or fail — something quite desirable if I must say, to have in a self-driving car.