GorMachine Learning | Computer Vision | Object Detection

In this post we’ll discuss YOLO, the landmark paper that laid the groundwork for modern real-time computer vision. We’ll start with a brief chronology of some relevant concepts, then go through YOLO step by step to build a thorough understanding of how it works.
Who is this useful for? Anyone interested in computer vision or cutting-edge AI advancements.
How advanced is this post? This article should be accessible to technology enthusiasts, and interesting to even the most skilled data scientists.
Pre-requisites: A good working understanding of a standard neural network. Some cursory experience with convolutional networks may also be useful.
A Brief Chronology of Computer Vision Before YOLO
The following sections contain useful concepts and technologies to know before getting into YOLO. Feel free to skip ahead if you feel confident.
Types of Computer Vision Problems
Computer vision is a class of several problem types, all of which relate to somehow enabling computers to "see" things. Typically, computer vision is broken up into the following:
- Image Classification: the task of trying to classify an entire image. For instance, one might classify an entire image as containing a cat or a dog.
- Object Detection: the task of finding instances of an object within an image, and where those instances are.
- Image Segmentation: the task of identifying the individual pixels within an image that correspond to a specific object. So, for instance, identify all the pixels within an image that correspond to dogs.

Convolutional Neural Networks
YOLO employs a form of model called a "Convolutional Neural Network". A convolutional neural network (CNN for short) is a style of neural network that applies a filter, called a "Kernel" over an image.

These "kernels" are simply a block of numbers. If the numbers in the kernel change, the result of the filtering process changes.

The actual filtering process consists of the kernel being swept through various parts of an image. At a given location, the kernels values are multiplied to the values in the image, and then added together to result in a new output. This process of "sweeping" is how CNNs get their name. In math, sweeping in this way is called "convolving".

For computer vision tasks, CNNs typically apply convolution and information compression over successive steps to break down an image into some dense and meaningful representation. This representation is then used by a classic neural network to achieve some final task.
The most common way a CNN compresses an image down into a meaningful representation is by employing "max pooling". Basically, you break an image up into N by N squares, then out of those squares you only preserve the highest value.

After a model has filtered (convolved) and down sampled (max pool) an image over numerous iterations, the result is a compressed representation that contains key information about the image. This is often passed through a dense network (a classic neural network) to produce the final output.

If you want to learn more about convolutional networks, I wrote a whole article on the topic:
Convolutional Networks – Intuitively and Exhaustively Explained
If you’re interested in the structure of CNNs, and how backbones and heads can be used in advanced training processes, you might be interested in this article:
Early Object Detection with Sliding Window
Before approaches like YOLO, "sliding window" was the go-to strategy in object detection. Recall that the goal of object detection is to detect instances of some object within an image.

In sliding window, the idea is to sweep a window across an image and classify the content of the window with a classification model.

Once classifications have been calculated, a final bounding box can be defined by simply combining all the classified windows.

There are a few tricks one can use to get this process working better. However, the sliding window strategy of object detection still suffers from two key problems:
- It’s very computationally intensive (you may have to run a model tens, hundreds, or even thousands of times per image)
- The bounding boxes are inaccurate
Selective Search and R-CNN
Instead of arbitrarily sweeping some window through an image, the idea of selective search is to find better windows based on the content of the image itself. In selective search, first small regions within an image that contain a lot of similar pixels are found, then similar neighboring regions are merged together over successive iterations to build larger regions. These large regions can be used to recommend bounding boxes.

With selective search, instead of finding random windows based on sweeping, bounding boxes are suggested by the image itself. Several approaches have used selective search to drastically improve object detection.
One of the most famous models to use this trick is R-CNN, which trained a tailored convolutional network based on proposed regions in order to enable high quality object detection.

R-CNN was a mainstay in computer vision for a while, and spawned many derivative ideas. However, it’s still very computationally intensive.
YOLO blew the paradigm of R-CNN out of the water, and inspired a fundamentally new way of thinking about image processing that remains relevant to this day. Let’s get into it.
YOLO: You Only Look Once
The idea of YOLO is to do everything in one pass of a CNN, hence why it’s called "You Only Look Once". That means a single CNN, in a single pass, has to somewhow find numerous different instances of objects, correctly classify them, and draw bounding boxes around them.

To achieve this, the authors of YOLO broke down the task of object detection into two sub-tasks, and built a model to do those sub-tasks simultaneously.
Subtask 1) Regionalized Classification
YOLO breaks images up into some arbitrary number of regions, and then classifies all those regions at the same time. It does this by modifying the output structure of a traditional CNN.
Normally a CNN compresses an image into a dense 2D representation, then a process called flattening is applied to that representation to turn it into a single vector, which can in turn be fed into a dense network to generate a classification.

Unlike this traditional approach, YOLO predicts classes for sub-regions of the image rather than the entire image.
In YOLO, the convolution output is flattened like normal, but then the output is converted back into a 2D representation of shape S x S x C
where S
represents how finely the image is subdivided into regions and C
represents the number of classes being predicted. Both S
and C
are configurable parameters which can be used to apply YOLO to different tasks.

Provided this model is trained correctly (we’ll cover that later), a model with this structure could classify numerous regions within an image in a single pass.

Not only is this more efficient than R-CNN (due to only one inference generating classes for an entire image), it’s also more performant. When R-CNN makes a prediction it only has access to the region proposed by selective search, which can make R-CNN bad at discerning irrelevant background objects.

CNNs have something called a "perceptive field", which means that CNNs by themselves suffer a similar issue. A particular spot within a CNN can only see a small subset of the image. In theory this might cause YOLO to also make poor choices about predictions.

However, YOLO passes the dense 2D representation through a fully connected neural network; allowing information with each perceptive field to interact with one another before making the final prediction.

This allows YOLO to reason about the entire image before making the final prediction, a key difference that makes YOLO more robust than R-CNN in terms of contextual awareness.
Subtask 2) Bounding Box Prediction
In theory we could use the predicted regions from the previous step to draw bounding boxes, but the results wouldn’t be very good. Also, what if there were two dogs next to each other? It would be impossible to distinguish two dogs from one wide dog, because both would just look like a bunch of squares labeled dog.

To alleviate these issues YOLO predicts bounding boxes as well as class predictions.
YOLO assigns a "responsibility" to each square in the S x S
grid. Basically, if a square contains the center of an object, then that square is responsible for creating the bounding box for that object.

The responsible square for a given object is in charge of drawing the bounding box for that object.
On top of the S x S x C
tensor for class prediction (which we covered in the previous section), YOLO also predicts an S x S x B x 5
tensor for bounding box prediction. In this tensor S
represents the divisions of the image (as before), and B
represents the number of bounding boxes each S x S
square can create. The 5
represents:
- Bounding box width
- Bounding box height
- Bounding box horizontal offset
- Bounding box vertical offset
- Bounding box confidence
So, in essence, YOLO creates a bunch of bounding boxes for each square in the S x S
grid. Specifically, YOLO creates B
bounding boxes per square.

If we only look at bounding boxes with high confidence scores, and the classes of the grid square those bounding boxes correspond to, we get the final output of YOLO.

We’ll re-visit the idea of "confidence" when we explore how YOLO is trained. For now, let’s take a step back and look at YOLO from a higher level.
The Architecture of YOLO
The cool thing about YOLO is that it does object detection in "one look". In one pass of the model both subtasks of class prediction and bounding box prediction are done simultaneously.
We unify the separate components of object detection into a single neural network. – The YOLO paper
Essentially, this is done by YOLO outputting the S x S x C
class predictions and the S x S x B x 5
bounding box predictions all in one shot, meaning YOLO outputs S x S x (B x 5 + C)
things.

Now that we understand the subtasks YOLO solves, and how it formats an output to solve those problems, we can start making sense of the actual architecture of YOLO.
As we’ve discussed, YOLO is a convolutional network that distills an image into a dense representation, then uses a fully connected network to construct the output.

In reality, the diagram above is somewhat of a simplification of the actual architecture, which is written out below the diagram. Let’s go through a few layers of YOLO to build a more thorough understanding.
First of all, the input image is an RGB image, meaning it has some width and height and three color channels. It looks like YOLO is designed to receive a square RGB image of width 448 and height 448. If you want to do YOLO on a smaller or larger image, you can just resize the image into 448 x 448.

The first layer of YOLO is listed as 7x7x64-s-2
, which means we have a convolutional layer with 64 kernels of size 7×7 that have a stride of 2.
When a convolutional model has multiple kernels, each of those kernels consists of different learned parameters, and they work together to make the final output.

In this particular layer, the kernels have a width and a height of 7, and instead of moving by one space at a time, it moves by two.

So, the first convolutional layer consists of 64 kernels of size 7×7 and a stride of 2.

After the first convolutional layer, the data is passed through a max pool of size 2×2 with a stride of 2.

The kernel of stride 2 reduces the dimension of the input by 1/2, the max pool of stride 2 reduces the dimension of the input by another 1/2, and the 64 kernels converts our 3 color channels to 64 kernel channels. This results in a tensor of shape 112 x 112 x 64
.

The YOLO architecture has many layers, many of which behave fundamentally similarly. I won’t bore you with an exploration of every single layer, but there are a few design details which are worth highlighting.
The idea of a 1 x 1
convolution is interesting, and kind of flies in the face of a normal intuition around convolution. If a kernel is only looking at one pixel, what’s the point?
Recall that a convolution applies a kernel not only to some n x n
region of the image, but also all input channels.

So a 1 x 1
convolution is essentially a filter that operates over only the channel dimension, and not the spatial dimension.
You may also wonder "why did the researchers who made YOLO settle on all of these numbers? Why 192 filters vs 200 here? Why a stride of 2 here and a stride of 1 there?"

The honest truth is that researchers usually use a combination of what others have done combined with concepts that seem cool to them. YOLO could probably have some of its network details changed without a major impact on performance. If you choose to build a model like this yourself, you often start with a baseline model and play around with different parameters to see if you can get something better.
One design constraint that YOLO does inherit from many CNNs is the concept of an information bottleneck. Throughout successive layers, the total amount of information is reduced. This is a pervasive concept in Machine Learning; that by passing data through a bottleneck you force a model to trim away irrelevant information and distill an input into its essence.

YOLO very heavily reduces the spatial dimension, while expanding the channel dimension, essentially implying that YOLO heavily breaks down a given region of an image, but increases the number of representations for that region.
Training YOLO
We’ve covered the nature of the output, as well as the structure of the model. Now let’s explore how the model is trained.

YOLO is trained, like many AI models, via "back-propagation". Back propagation generally consists of:
- Calculating how wrong the prediction is
- Calculating the influence each parameter had on the current output
- Using the wrongness of the model, with the influence of the parameters, to update the parameters.

Going over back propagation in depth is out of scope for this article, and also isn’t necessary. Many machine learning engineers and researchers rely on frameworks like PyTorch to do a lot of this stuff for them. The only critical thing we need to understand for our purposes is something called a "Loss Function".
The loss function governs how "wrongness" in a model is defined, which governs the entire training process. The whole point of training is to optimize the model to get better results from the loss function.
Because YOLO has such a complex output, the loss function is pretty complicated.

I know the math might seem a bit daunting, but we’ll take it step by step. As we explore the loss function, keep in mind that we’re training the model based on an annotated dataset. That means we know all classes and bounding boxes ahead of time. Our loss function calculates our models predictions versus these known values, which means reducing the loss from the loss function means our model is better.

The loss can be conceptualized as a product of two things, how bad the model is at predicting classes, and how bad the model is at predicting bounding boxes.

Let’s unpack each of these losses, one at a time.
Class Loss
The expression for class loss is the following:

Working through this loss function element by element. Σ (capital sigma) means addition. The first Σ means we’ll be iterating through all of the S x S
grids in the image and adding up a value from each grid square.


Not every square has an object in it though. YOLO only learns class predictions from squares that have an object inside of them.

So, together, you can think of this expression as summing the loss from all the squares that have an object inside of them.

the loss from all the squares that have an object inside of them is this expression:

This essentially says "for all classes, subtract the predicted probability by the actual probability, then square it". For example, if the predicted classes for a given square were this:
dog = 0.9
cat = 0.05
zebra = 0.05
But the square actually contains a zebra
dog = 0
cat = 0
zebra = 1
Then the class loss for that particular square would be
(0.9-0)^2 + (0.05-0)^2 + (0.05-1)^2
Looking at it from a high level, the class loss looks at all the squares that contain an object, and adds up how wrong all the class predictions were for all of those squares.

Bounding Box Loss
the bounding box loss can be further broken down into two distinct loss functions. One for the "confidence" of a bounding box, and one for the bounding box coordinates.

Recall that the bounding boxes have a x
and y
coordinate, a width
and height
, and a confidence.

The coordinate loss is pretty straight forward, if you understand the class loss.


first of all, the only reason this loss function is broken up into two parts is because it’s long, not because it’s especially complicated.
Immediately we see a very similar expression to one we discussed recently.

this sums the loss for all squares, in the image (s) and all bounding boxes in each of those squares (B). But, it only ads the loss if an object exists in that square and that particular bounding box is responsible for that object.
The coordinate loss for a particular bounding box is very straight forward. It’s just like the difference of class probabilities discussed in the previous section, but for the bounding box dimensions (width, height, x, and y)

So that’s the actual bounding box coordinates. Recall that bounding box loss comprises of both the coordinates and the confidence, where the confidence is some value that allows a bounding box to say "I should be responsible for this particular object".

The confidence loss is calculated with the following expression:

We have the same Σ expressions we’ve seen previously, where we’re adding the loss across all cells and all bounding boxes in each cell.

Except, this time, we have two separate filters. One for if the bounding box is supposed to be responsible for an object, and one for if the bounding box is not supposed to be responsible for an object.

A question might arise at this point. It’s simple what class probabilities should be, it’s whatever the dataset says a class is for a given cell in an image. It’s also simple what the bounding box dimensions should be, it’s whatever the dataset says they should be. But how do we compute what the confidence score of a bounding box should be? How do we say this bounding box should have been responsible, and this one should not have, for a given training example?
This is done by assigning whatever bounding box happens to have the highest "Intersection Over Union" (IOU) as responsible. The IOU is the intersection over union, and is essentially a measure of how well two bounding boxes overlap.

So, during training, the bounding box that should be responsible for detecting an object is whichever one already happens to be best at detecting that object. All other bounding boxes in that cell should not be responsible for that object.
In YOLO, the confidence should be 0 if a bounding box is not responsible for an object, otherwise the confidence should be the IOU of the bounding box with the ground truth bounding box.

It’s useful to reflect on bounding box dimension loss at this point. While, at any given step, the confidence should be whatever IOU the bounding box has vs the ground truth, over successive training steps the bounding boxes should be getting better. So, one would expect the model to learn to increase the confidence scores of bounding boxes as the model learns to predict better bounding boxes.
There’s a fascinating quirk of this system. Basically, because YOLO can choose which bounding box should be appropriate, it usually ends up making multiple bounding boxes each of which specialize in certain sized objects.
Anyway, that’s the loss function. Take a look at it, in all it’s glory, and reflect on how this one function is used to guide the training of the entire model.

Intricacies of Training YOLO
While discussing the loss function I glossed over two intricacies; λs (lambdas) throughout the loss function, and the fact that the width and height of bounding boxes are square rooted.

The λs in the loss function are scaling parameters that tune the loss function to better train the model. If we think about some image that might appear in the dataset, most grid cells probably aren’t responsible for an object.

Two quirks might arise from this reality:
- Instead of learning the difficult task of Object Detection, the model might just learn that most bounding boxes should have a confidence of zero, and would thus set all confidences to zero.
- There are
(S x S x B)
predicted bounding boxes, but only a few of them matter. The model is trying to minimize the entire loss function, so if the loss by bounding box dimensions is a small number, it’s possible the model will hardly try to improve bounding box predictions throughout the training process.
The λ parameters solve both of these problems. By making λ_noobj smaller, and making λ_coord larger, the training function can be tuned to make squares that contain objects more important, and increase the importance of correct bounding box predictions. The authors of YOLO settled on the following:
λ_noobj = 0.5
λ_coord = 5
Another topic we glossed over is the square root of the width and height of the bounding boxes. Refer to the following examples of ground truth and predicted bounding boxes:

Most people would consider the left example to be worse than the right example, because the bounding box width is more wrong on the length than the right. However, both bounding boxes are both equally inaccurate. The difference between the width of the ground truth and the predicted bounding box is around 180 pixels for both images.
As humans, we don’t care about absolute difference of bounding box size, we care about relative difference of bounding box size. The bounding box on the left is smaller, so the acceptable error in width and height is smaller. The YOLO loss function takes the square root of the width and height to reduce the impact of larger bounding boxes, thereby making it more important to accurately predict small bounding box sizes.

Inference
So, we’ve trained YOLO. We gave it a bunch of images with bounding boxes, and computed the loss to inform the model how to update its parameters via back propagation.
Now that we have a trained YOLO model, let’s explore how to use it.
After we run an image through YOLO, we get back a tensor like this:

First of all, we can figure out what classes YOLO thinks corresponds to each grid cell by simply finding the maximum class prediction in each grid cell.

Each of those grid cells has a bunch of bounding boxes with confidence scores.

We can simply set some confidence threshold, like 0.8, to only preserve highly confident bounding boxes. And thus we have our output.

There are some quirks to this strategy, however. While YOLO can output multiple bounding boxes per square, facilitating the detection of small and close-by objects.

Unfortunately, a common issue is that multiple bounding boxes might be very confident about the same object.

First of all, we can define a threshold for bounding boxes that overlap too much. If the IOU between two bounding boxes is above 70%, for instance, then we can say they’re trying to predict the same object.
Once we’ve identified a group of bounding boxes as attempting to predict the same thing, we can apply "non-max suppression". Basically, we just disregard all but the bounding box with the highest confidence.

Conclusion
And that’s it. In this article we covered YOLO, an object detection model that unified object detection into a single task, allowing it to be both computationally efficient and highly performant. In understanding how YOLO achieved this, we explored the architecture of the model (including its large output tensor), the loss function, and some realities about training and inferencing the model.
This article focuses on the first YOLO paper. There have been many improvements made to YOLO since it’s publication. I’m planning on covering that in another IAEE article.
If you’re interested in YOLO, and want to learn more about some of the ethical concerns it raises, I have an opinion piece on that.
Join IAEE
At IAEE you can find:
- Long form content, like the article you just read
- Thought pieces, based on my experience as a data scientist, engineering director, and entrepreneur
- A discord community focused on learning AI
- Lectures, by me, every week
