Artificial Intelligence

YOLO: You Only Look Once | Object Detection

A deep learning Convolutional Neural Network (CNN)

Ronit Taleti
Towards Data Science
5 min readFeb 13, 2020

--

Photo by Clément H on Unsplash

“Hey, our friends are cliff jumping, wanna come? You only live once man!”

“Nah, I only have to look once to know I’m not doing that.”

You are probably used to the saying “YOLO: You only LIVE once”. But, that’s not actually the subject of this article. I’m talking about “YOLO: You only LOOK once”.

It is a very clever algorithm which is really good at detecting where and what objects are in a given image or video. In fact, it only has to look once.

How does it work?

Neural Networks

Neural networks power YOLO, and they are a special type of computer algorithm. Named because they are modeled after our brains, they are engineered to detect patterns (like our brains!).

YOLO itself is a Convolutional Neural Network (CNN), a type of neural network, which are very good at detecting patterns (and by extension objects and the like) in images.

Neural Network Layers

Neural networks are made up of layers, and CNNs are made of mostly convolutional layers (hence the name). It is like sliding a filter over an image, and each layer is progressively more complex. Here is a visualization:

Image from Stanford (not hosted there anymore)

The filter here is the numbers we are using to multiply. Basically, we look at one part of the image, apply the filter, and sum the result to get a convolved feature. If we think of the image being grayscale, each number is how bright that pixel is.

The early layers may slide across filters designed to find edges, corners, and shapes, but later layers may detect more complex features like limbs, faces, and entire objects. Here is an edge detection convolutional network.

Photo from Denny Britz at KDNuggets

And the input and output image:

Photo from Denny Britz at KDNuggets
Photo from Denny Britz at KDNuggets

Pooling Layers

Pooling is much simpler, and also similar to convolution. Instead of detecting features, pooling is meant to lighten the load on the network by reducing its spatial size.

Image from Xinhui Wang

Basically, it reduces the size of the inputs of the convolutional layers, with normally two methods:

Max Pooling
Max pooling is the first type, wherein you take the biggest number in each cell of the image.

Average Pooling
The second type of pooling, wherein you average all the numbers in each cell of the image.

Activation Layers

Activation layers are key for a neural network. At each step, we must define the output which will be the input for the next step (or the actual output at the very end).

Image from GumGum

YOLO uses ReLU (Rectified Linear Unit). ReLU activation is used at each convolutional step, and not the pooling ones, since they are just meant to lighten the load. Basically, ReLU works by modifying the output so anything less than 0 becomes zero, and everything else is raw output (basically not changing anything)

A visualization of ReLU on a graph. “x” is ReLU’s input, while “y” is ReLU’s output. (Source: Fernando de Meer Pardo)

And the whole YOLO network looks like this.

Don’t worry, you don’t need to fully understand this! (Source: MDPI)

And this is why YOLO is a CNN. CNNs are very good at image classification (figuring out what an object is), but that’s only half of the problem.

What about figuring out where objects are?

Although we have covered how to classify objects, we also want to be able to know where those objects are in an image.

YOLO first will divide a given image or frame of a video into N number of squares. Then, each grid cell detects objects in it and predicts a number of bounding boxes, and confidence scores for those boxes (a bounding box is just it predicting where objects are and placing a box around it).

After that, it takes the confidence scores high enough to likely be an object, and predicts what that object is, based on training data. This means you can only detect an object with YOLO after giving it a dataset of objects.

A visual representation. (The bolder the bounding box, the more accurate it is.) (Source: Pj Reddy)

This allows you to detect objects, and predict what they are!

My Implementation

I decided to create an implementation of YOLO, which can predict different objects in an airport scene!

I found out about the YOLO Coco dataset, which is a pre-made dataset good for detecting general objects, like suitcases, people, cars, skateboards, etc. which made things a lot easier for me.

After programming everything, and after YOLO learned the dataset, YOLO was able to produce this image.

Image by me

Use Cases

There are some use cases for YOLO as well, where YOLO can be really helpful! Here are some:

Recycling

YOLO can be used in recycling plants, to help control robots and sort the waste. Since YOLO is so good at detecting objects, we can train it to sort through waste in a recycling facility

Self Driving Cars

YOLO can be especially useful in self driving cars, and in fact is already used today to detect cars, people, and traffic lights! It allows for full autonomous control!

Suspect Detection

This is far fetched, but YOLO could assist in police investigations by figuring out who a potential suspect could be. This is greatly helpful in low quality recordings, and the like!

Image from me

So, that’s YOLO object detection! It is probably the best and most efficient algorithm of its kind, and there are many opportunities for it. This is just a surface-level explanation, and I hope that I’ve taught you enough for you to be able to use your curiosity to propel you ever further!

If you enjoyed reading this article or have any suggestions or questions, let me know by commenting. You can find me on LinkedIn, or on my website for my latest work and updates, or reach out directly on my email!

--

--

I’m an avid 17-year old blogger interested in new and emerging technologies like Artificial Intelligence, Blockchain, and Virtual/Augmented Reality.