The world’s leading publication for data science, AI, and ML professionals.

YOLO – Intuitively and Exhaustively Explained

The genesis of the most widely used object detection models.

GorMachine Learning | Computer Vision | Object Detection

"Look Once" by Daniel Warfield using MidJourney. All images by the author unless otherwise specified.
"Look Once" by Daniel Warfield using MidJourney. All images by the author unless otherwise specified.

In this post we’ll discuss YOLO, the landmark paper that laid the groundwork for modern real-time computer vision. We’ll start with a brief chronology of some relevant concepts, then go through YOLO step by step to build a thorough understanding of how it works.

Who is this useful for? Anyone interested in computer vision or cutting-edge AI advancements.

How advanced is this post? This article should be accessible to technology enthusiasts, and interesting to even the most skilled data scientists.

Pre-requisites: A good working understanding of a standard neural network. Some cursory experience with convolutional networks may also be useful.

A Brief Chronology of Computer Vision Before YOLO

The following sections contain useful concepts and technologies to know before getting into YOLO. Feel free to skip ahead if you feel confident.

Types of Computer Vision Problems

Computer vision is a class of several problem types, all of which relate to somehow enabling computers to "see" things. Typically, computer vision is broken up into the following:

  1. Image Classification: the task of trying to classify an entire image. For instance, one might classify an entire image as containing a cat or a dog.
  2. Object Detection: the task of finding instances of an object within an image, and where those instances are.
  3. Image Segmentation: the task of identifying the individual pixels within an image that correspond to a specific object. So, for instance, identify all the pixels within an image that correspond to dogs.
The three major sub-problems of computer vision. This is somewhat of a simplification; in reality there are sub-problems within these sub-problems, but that's out of scope for this article. Yolo, the topic of this article, was a breakthrough object detection model. Source.
The three major sub-problems of computer vision. This is somewhat of a simplification; in reality there are sub-problems within these sub-problems, but that’s out of scope for this article. Yolo, the topic of this article, was a breakthrough object detection model. Source.

Convolutional Neural Networks

YOLO employs a form of model called a "Convolutional Neural Network". A convolutional neural network (CNN for short) is a style of neural network that applies a filter, called a "Kernel" over an image.

A conceptual diagram of a convolutional network working over an image. From my article on CNNs
A conceptual diagram of a convolutional network working over an image. From my article on CNNs

These "kernels" are simply a block of numbers. If the numbers in the kernel change, the result of the filtering process changes.

A Kernel, applied over an image, acts as a filter which modifies that image. CNNs learn to change the values in the kernel to improve whatever task they're being trained on. From my article on CNNs. Source
A Kernel, applied over an image, acts as a filter which modifies that image. CNNs learn to change the values in the kernel to improve whatever task they’re being trained on. From my article on CNNs. Source

The actual filtering process consists of the kernel being swept through various parts of an image. At a given location, the kernels values are multiplied to the values in the image, and then added together to result in a new output. This process of "sweeping" is how CNNs get their name. In math, sweeping in this way is called "convolving".

The process of convolving a kernel over an image. From my article on CNNs.
The process of convolving a kernel over an image. From my article on CNNs.

For computer vision tasks, CNNs typically apply convolution and information compression over successive steps to break down an image into some dense and meaningful representation. This representation is then used by a classic neural network to achieve some final task.

The most common way a CNN compresses an image down into a meaningful representation is by employing "max pooling". Basically, you break an image up into N by N squares, then out of those squares you only preserve the highest value.

A conceptual diagram of max pooling, from my article on CNNs.
A conceptual diagram of max pooling, from my article on CNNs.

After a model has filtered (convolved) and down sampled (max pool) an image over numerous iterations, the result is a compressed representation that contains key information about the image. This is often passed through a dense network (a classic neural network) to produce the final output.

An example of how a convolutional network actually solves problems. The model passes an image through convolution and down sampling (max pooling) until it creates an abstract and dense representation of the image. This is passed through a neural network to produce the final output. The convolutional section of this form of model is often called a "backbone", and the neural network at the end is often called a "Head". From my article on CNNs.
An example of how a convolutional network actually solves problems. The model passes an image through convolution and down sampling (max pooling) until it creates an abstract and dense representation of the image. This is passed through a neural network to produce the final output. The convolutional section of this form of model is often called a "backbone", and the neural network at the end is often called a "Head". From my article on CNNs.

If you want to learn more about convolutional networks, I wrote a whole article on the topic:

Convolutional Networks – Intuitively and Exhaustively Explained

If you’re interested in the structure of CNNs, and how backbones and heads can be used in advanced training processes, you might be interested in this article:

Self-Supervised Learning Using Projection Heads

Early Object Detection with Sliding Window

Before approaches like YOLO, "sliding window" was the go-to strategy in object detection. Recall that the goal of object detection is to detect instances of some object within an image.

The three major sub-problems of computer vision.
The three major sub-problems of computer vision.

In sliding window, the idea is to sweep a window across an image and classify the content of the window with a classification model.

A conceptual demonstration of the process of classifying different windows as containing a dog or not. A classification model, which has been trained to identify if an image contains a dog or not, is shown several "windows" within an image. We can record the windows the model thinks contain a dog, and thus create a family of windows that likely contain a dog.
A conceptual demonstration of the process of classifying different windows as containing a dog or not. A classification model, which has been trained to identify if an image contains a dog or not, is shown several "windows" within an image. We can record the windows the model thinks contain a dog, and thus create a family of windows that likely contain a dog.

Once classifications have been calculated, a final bounding box can be defined by simply combining all the classified windows.

Once several classifications have been made via sliding window, an overall bounding box can be calculated. In this example, we simply found the bounding box which includes all windows where the model predicted a presence of a dog.
Once several classifications have been made via sliding window, an overall bounding box can be calculated. In this example, we simply found the bounding box which includes all windows where the model predicted a presence of a dog.

There are a few tricks one can use to get this process working better. However, the sliding window strategy of object detection still suffers from two key problems:

  • It’s very computationally intensive (you may have to run a model tens, hundreds, or even thousands of times per image)
  • The bounding boxes are inaccurate

Selective Search and R-CNN

Instead of arbitrarily sweeping some window through an image, the idea of selective search is to find better windows based on the content of the image itself. In selective search, first small regions within an image that contain a lot of similar pixels are found, then similar neighboring regions are merged together over successive iterations to build larger regions. These large regions can be used to recommend bounding boxes.

Selective search, which creates fine regions within an image, then combines those regions iteratively to construct larger region proposals. Source.
Selective search, which creates fine regions within an image, then combines those regions iteratively to construct larger region proposals. Source.

With selective search, instead of finding random windows based on sweeping, bounding boxes are suggested by the image itself. Several approaches have used selective search to drastically improve object detection.

One of the most famous models to use this trick is R-CNN, which trained a tailored convolutional network based on proposed regions in order to enable high quality object detection.

A diagram from the R-CNN paper, which uses selective search to propose regions and a tailored CNN to predict based on those regions. Source
A diagram from the R-CNN paper, which uses selective search to propose regions and a tailored CNN to predict based on those regions. Source

R-CNN was a mainstay in computer vision for a while, and spawned many derivative ideas. However, it’s still very computationally intensive.

YOLO blew the paradigm of R-CNN out of the water, and inspired a fundamentally new way of thinking about image processing that remains relevant to this day. Let’s get into it.

YOLO: You Only Look Once

The idea of YOLO is to do everything in one pass of a CNN, hence why it’s called "You Only Look Once". That means a single CNN, in a single pass, has to somewhow find numerous different instances of objects, correctly classify them, and draw bounding boxes around them.

An example of YOLO in action, from the YOLO paper.
An example of YOLO in action, from the YOLO paper.

To achieve this, the authors of YOLO broke down the task of object detection into two sub-tasks, and built a model to do those sub-tasks simultaneously.

Subtask 1) Regionalized Classification

YOLO breaks images up into some arbitrary number of regions, and then classifies all those regions at the same time. It does this by modifying the output structure of a traditional CNN.

Normally a CNN compresses an image into a dense 2D representation, then a process called flattening is applied to that representation to turn it into a single vector, which can in turn be fed into a dense network to generate a classification.

A conceptual diagram of how a conventional CNN might predict if there is a dog in an image, vs nothing in an image. Convolution and max pooling create a dense representation of the image which is flattened and passed through a neural network.
A conceptual diagram of how a conventional CNN might predict if there is a dog in an image, vs nothing in an image. Convolution and max pooling create a dense representation of the image which is flattened and passed through a neural network.

Unlike this traditional approach, YOLO predicts classes for sub-regions of the image rather than the entire image.

In YOLO, the convolution output is flattened like normal, but then the output is converted back into a 2D representation of shape S x S x C where S represents how finely the image is subdivided into regions and C represents the number of classes being predicted. Both S and C are configurable parameters which can be used to apply YOLO to different tasks.

A conceptual diagram of YOLO predicting classes within subregions of an image. If there are two classes (C=2, "dog" and "no object") and the image is divided 7 ways in both length and width (S=7), the output of YOLOs class prediction looks like the one in the image above, with all classes being predicted for each grid space. In reality, the output doesn't have to actually have this output shape, it just needs to have a spot for each output. So the final output for this model would be a vector of length S x S x C.
A conceptual diagram of YOLO predicting classes within subregions of an image. If there are two classes (C=2, "dog" and "no object") and the image is divided 7 ways in both length and width (S=7), the output of YOLOs class prediction looks like the one in the image above, with all classes being predicted for each grid space. In reality, the output doesn’t have to actually have this output shape, it just needs to have a spot for each output. So the final output for this model would be a vector of length S x S x C.

Provided this model is trained correctly (we’ll cover that later), a model with this structure could classify numerous regions within an image in a single pass.

An example of YOLO predicting various classes within sub-regions of an image. From the YOLO paper.
An example of YOLO predicting various classes within sub-regions of an image. From the YOLO paper.

Not only is this more efficient than R-CNN (due to only one inference generating classes for an entire image), it’s also more performant. When R-CNN makes a prediction it only has access to the region proposed by selective search, which can make R-CNN bad at discerning irrelevant background objects.

An example of what a bad background detection might look like. Because R-CNN can't reason about the entire image (it's only given small regions of an image), it may create random bounding boxes that don't make sense in the context of the image. Source.
An example of what a bad background detection might look like. Because R-CNN can’t reason about the entire image (it’s only given small regions of an image), it may create random bounding boxes that don’t make sense in the context of the image. Source.

CNNs have something called a "perceptive field", which means that CNNs by themselves suffer a similar issue. A particular spot within a CNN can only see a small subset of the image. In theory this might cause YOLO to also make poor choices about predictions.

Because of the way CNNs shake out, the final dense representation contains region specific information. This concept is typically referred to as a "perceptive field" as the top left of the dense representation can only "perceive" (contain information based on) the top left of the input image.
Because of the way CNNs shake out, the final dense representation contains region specific information. This concept is typically referred to as a "perceptive field" as the top left of the dense representation can only "perceive" (contain information based on) the top left of the input image.

However, YOLO passes the dense 2D representation through a fully connected neural network; allowing information with each perceptive field to interact with one another before making the final prediction.

While the dense representation from the CNN is region specific, because the result is passed through a dense network, the final output of any given region is based on information from all regions in the image.
While the dense representation from the CNN is region specific, because the result is passed through a dense network, the final output of any given region is based on information from all regions in the image.

This allows YOLO to reason about the entire image before making the final prediction, a key difference that makes YOLO more robust than R-CNN in terms of contextual awareness.

Subtask 2) Bounding Box Prediction

In theory we could use the predicted regions from the previous step to draw bounding boxes, but the results wouldn’t be very good. Also, what if there were two dogs next to each other? It would be impossible to distinguish two dogs from one wide dog, because both would just look like a bunch of squares labeled dog.

Imagine using these predictions to draw bounding boxes. While it would be better than nothing, the bounding boxes would all be much larger than the objects they represent, and would fail to capture individual instances of the same thing that are next to each other. From the YOLO paper.
Imagine using these predictions to draw bounding boxes. While it would be better than nothing, the bounding boxes would all be much larger than the objects they represent, and would fail to capture individual instances of the same thing that are next to each other. From the YOLO paper.

To alleviate these issues YOLO predicts bounding boxes as well as class predictions.

YOLO assigns a "responsibility" to each square in the S x S grid. Basically, if a square contains the center of an object, then that square is responsible for creating the bounding box for that object.

A conceptual diagram of "responsibility" in YOLO. The SxS Square that contains the center of the dog is "responsible" for the dog, and likewise for the bike and the car. From the YOLO paper.
A conceptual diagram of "responsibility" in YOLO. The SxS Square that contains the center of the dog is "responsible" for the dog, and likewise for the bike and the car. From the YOLO paper.

The responsible square for a given object is in charge of drawing the bounding box for that object.

On top of the S x S x C tensor for class prediction (which we covered in the previous section), YOLO also predicts an S x S x B x 5 tensor for bounding box prediction. In this tensor S represents the divisions of the image (as before), and B represents the number of bounding boxes each S x S square can create. The 5 represents:

  1. Bounding box width
  2. Bounding box height
  3. Bounding box horizontal offset
  4. Bounding box vertical offset
  5. Bounding box confidence

So, in essence, YOLO creates a bunch of bounding boxes for each square in the S x S grid. Specifically, YOLO creates B bounding boxes per square.

All the bounding boxes predicted by YOLO, where the thickness of the bounding box corresponds with that bounding boxes "confidence" output. YOLO predicts numerous bounding boxes per square, as specified by the "B" parameter. From the YOLO paper.
All the bounding boxes predicted by YOLO, where the thickness of the bounding box corresponds with that bounding boxes "confidence" output. YOLO predicts numerous bounding boxes per square, as specified by the "B" parameter. From the YOLO paper.

If we only look at bounding boxes with high confidence scores, and the classes of the grid square those bounding boxes correspond to, we get the final output of YOLO.

We can use the bounding box predictions, and the regional class probabilities, to create our final bounding boxes.
We can use the bounding box predictions, and the regional class probabilities, to create our final bounding boxes.

We’ll re-visit the idea of "confidence" when we explore how YOLO is trained. For now, let’s take a step back and look at YOLO from a higher level.

The Architecture of YOLO

The cool thing about YOLO is that it does object detection in "one look". In one pass of the model both subtasks of class prediction and bounding box prediction are done simultaneously.

We unify the separate components of object detection into a single neural network. – The YOLO paper

Essentially, this is done by YOLO outputting the S x S x C class predictions and the S x S x B x 5 bounding box predictions all in one shot, meaning YOLO outputs S x S x (B x 5 + C) things.

An example of YOLOs outputs if S=3 (the division of the image), C=3 (the number of classes, for instance "Dog", "Cat", and "background"), and B=2 (how many bounding boxes of "x", "y", "width", "height" and "confidence" exist within each division). Keep in mind that this data shape is chiefly for demonstrative purposes. The shape of the output of YOLO can be three dimensional (S, S, B5+C), or it could just as well be a vector of length SS(B5+C). Either approach functions equivalently, and is ultimately an implementation detail.
An example of YOLOs outputs if S=3 (the division of the image), C=3 (the number of classes, for instance "Dog", "Cat", and "background"), and B=2 (how many bounding boxes of "x", "y", "width", "height" and "confidence" exist within each division). Keep in mind that this data shape is chiefly for demonstrative purposes. The shape of the output of YOLO can be three dimensional (S, S, B5+C), or it could just as well be a vector of length SS(B5+C). Either approach functions equivalently, and is ultimately an implementation detail.

Now that we understand the subtasks YOLO solves, and how it formats an output to solve those problems, we can start making sense of the actual architecture of YOLO.

As we’ve discussed, YOLO is a convolutional network that distills an image into a dense representation, then uses a fully connected network to construct the output.

The architecture diagram of YOLO. Source
The architecture diagram of YOLO. Source

In reality, the diagram above is somewhat of a simplification of the actual architecture, which is written out below the diagram. Let’s go through a few layers of YOLO to build a more thorough understanding.

First of all, the input image is an RGB image, meaning it has some width and height and three color channels. It looks like YOLO is designed to receive a square RGB image of width 448 and height 448. If you want to do YOLO on a smaller or larger image, you can just resize the image into 448 x 448.

A conceptual diagram of turning an image into the input of YOLO. The image gets squashed into 448x448, then each color channel represents some depth of the input.
A conceptual diagram of turning an image into the input of YOLO. The image gets squashed into 448×448, then each color channel represents some depth of the input.

The first layer of YOLO is listed as 7x7x64-s-2 , which means we have a convolutional layer with 64 kernels of size 7×7 that have a stride of 2.

When a convolutional model has multiple kernels, each of those kernels consists of different learned parameters, and they work together to make the final output.

A conceptual diagram of multiple kernels working together to construct an output. From my article on CNNs
A conceptual diagram of multiple kernels working together to construct an output. From my article on CNNs

In this particular layer, the kernels have a width and a height of 7, and instead of moving by one space at a time, it moves by two.

Conceptual diagram of strides of length one (left), two (middle), and three (right), all for a kernel of size two. From my article on convolutional networks.
Conceptual diagram of strides of length one (left), two (middle), and three (right), all for a kernel of size two. From my article on convolutional networks.

So, the first convolutional layer consists of 64 kernels of size 7×7 and a stride of 2.

The first layer of YOLO, which we just discussed. Source.
The first layer of YOLO, which we just discussed. Source.

After the first convolutional layer, the data is passed through a max pool of size 2×2 with a stride of 2.

Recall that max pooling takes some window of data and only preserves the maximum value, effectively downsampling. In the layer we're discussing these windows would be 2x2, instead of 3x3 as shown in the image. From my article on CNNs.
Recall that max pooling takes some window of data and only preserves the maximum value, effectively downsampling. In the layer we’re discussing these windows would be 2×2, instead of 3×3 as shown in the image. From my article on CNNs.

The kernel of stride 2 reduces the dimension of the input by 1/2, the max pool of stride 2 reduces the dimension of the input by another 1/2, and the 64 kernels converts our 3 color channels to 64 kernel channels. This results in a tensor of shape 112 x 112 x 64 .

The first two layers of YOLO acting on the input. Both the convolution and max pooling reduce the special dimension by half, and applying 64 kernels changes the depth of data to a dpeth of 64.
The first two layers of YOLO acting on the input. Both the convolution and max pooling reduce the special dimension by half, and applying 64 kernels changes the depth of data to a dpeth of 64.

The YOLO architecture has many layers, many of which behave fundamentally similarly. I won’t bore you with an exploration of every single layer, but there are a few design details which are worth highlighting.

The idea of a 1 x 1 convolution is interesting, and kind of flies in the face of a normal intuition around convolution. If a kernel is only looking at one pixel, what’s the point?

Recall that a convolution applies a kernel not only to some n x n region of the image, but also all input channels.

While convolution is typically drawn as a 2D matrix that propagates over an input, in reality kernels are 3D, and apply over all channels in the input. I cover the dimensionality of kernels extensively in my article on CNNs. This animation shows a 3x3 kernel applied over the three input channels. A 1x1 kernel in this context would be a vector that is applied to all input channels.
While convolution is typically drawn as a 2D matrix that propagates over an input, in reality kernels are 3D, and apply over all channels in the input. I cover the dimensionality of kernels extensively in my article on CNNs. This animation shows a 3×3 kernel applied over the three input channels. A 1×1 kernel in this context would be a vector that is applied to all input channels.

So a 1 x 1 convolution is essentially a filter that operates over only the channel dimension, and not the spatial dimension.

You may also wonder "why did the researchers who made YOLO settle on all of these numbers? Why 192 filters vs 200 here? Why a stride of 2 here and a stride of 1 there?"

Recall the architecture diagram of YOLO. A natural question might be "why did how did the researchers come up with all these numbers?" Source
Recall the architecture diagram of YOLO. A natural question might be "why did how did the researchers come up with all these numbers?" Source

The honest truth is that researchers usually use a combination of what others have done combined with concepts that seem cool to them. YOLO could probably have some of its network details changed without a major impact on performance. If you choose to build a model like this yourself, you often start with a baseline model and play around with different parameters to see if you can get something better.

One design constraint that YOLO does inherit from many CNNs is the concept of an information bottleneck. Throughout successive layers, the total amount of information is reduced. This is a pervasive concept in Machine Learning; that by passing data through a bottleneck you force a model to trim away irrelevant information and distill an input into its essence.

The total amount of numbers that are used to describe the image at each layer in YOLO. As can be seen, YOLO is allowed to represent the image in a much larger form initially, but successive layers force the same image to be represented in smaller and smaller representations. This forces the convolutional layers to work together to distill the image into it's essence.
The total amount of numbers that are used to describe the image at each layer in YOLO. As can be seen, YOLO is allowed to represent the image in a much larger form initially, but successive layers force the same image to be represented in smaller and smaller representations. This forces the convolutional layers to work together to distill the image into it’s essence.

YOLO very heavily reduces the spatial dimension, while expanding the channel dimension, essentially implying that YOLO heavily breaks down a given region of an image, but increases the number of representations for that region.

Training YOLO

We’ve covered the nature of the output, as well as the structure of the model. Now let’s explore how the model is trained.

Join IAEE
Join IAEE

YOLO is trained, like many AI models, via "back-propagation". Back propagation generally consists of:

  • Calculating how wrong the prediction is
  • Calculating the influence each parameter had on the current output
  • Using the wrongness of the model, with the influence of the parameters, to update the parameters.
A conceptual diagram of back propagation. In this diagram the "Gradients" represent the level of impact model parameters have on the output, and the "Loss Function" represents how wrong the model is, and in which way. This particular diagram is based on LLMs (Large language models), but the concept is exactly the same for computer vision models like YOLO. From my article on LoRA
A conceptual diagram of back propagation. In this diagram the "Gradients" represent the level of impact model parameters have on the output, and the "Loss Function" represents how wrong the model is, and in which way. This particular diagram is based on LLMs (Large language models), but the concept is exactly the same for computer vision models like YOLO. From my article on LoRA

Going over back propagation in depth is out of scope for this article, and also isn’t necessary. Many machine learning engineers and researchers rely on frameworks like PyTorch to do a lot of this stuff for them. The only critical thing we need to understand for our purposes is something called a "Loss Function".

The loss function governs how "wrongness" in a model is defined, which governs the entire training process. The whole point of training is to optimize the model to get better results from the loss function.

Because YOLO has such a complex output, the loss function is pretty complicated.

YOLOs loss function, source.
YOLOs loss function, source.

I know the math might seem a bit daunting, but we’ll take it step by step. As we explore the loss function, keep in mind that we’re training the model based on an annotated dataset. That means we know all classes and bounding boxes ahead of time. Our loss function calculates our models predictions versus these known values, which means reducing the loss from the loss function means our model is better.

Examples from the Pascal VOC dataset, which YOLO was trained on. Source.
Examples from the Pascal VOC dataset, which YOLO was trained on. Source.

The loss can be conceptualized as a product of two things, how bad the model is at predicting classes, and how bad the model is at predicting bounding boxes.

The total loss from YOLO consists of two things: The loss from class prediction and the loss from bounding box prediction.
The total loss from YOLO consists of two things: The loss from class prediction and the loss from bounding box prediction.

Let’s unpack each of these losses, one at a time.

Class Loss

The expression for class loss is the following:

Working through this loss function element by element. Σ (capital sigma) means addition. The first Σ means we’ll be iterating through all of the S x S grids in the image and adding up a value from each grid square.

Recall that YOLO divides images into an S x S grid. The class loss is calculated by adding up the loss from each square. From the YOLO paper.
Recall that YOLO divides images into an S x S grid. The class loss is calculated by adding up the loss from each square. From the YOLO paper.

Not every square has an object in it though. YOLO only learns class predictions from squares that have an object inside of them.

So, together, you can think of this expression as summing the loss from all the squares that have an object inside of them.

This expression adds the class loss from all of the squares which have a class in it.
This expression adds the class loss from all of the squares which have a class in it.

the loss from all the squares that have an object inside of them is this expression:

This essentially says "for all classes, subtract the predicted probability by the actual probability, then square it". For example, if the predicted classes for a given square were this:

dog = 0.9
cat = 0.05
zebra = 0.05

But the square actually contains a zebra

dog = 0
cat = 0
zebra = 1

Then the class loss for that particular square would be

(0.9-0)^2 + (0.05-0)^2 + (0.05-1)^2

Looking at it from a high level, the class loss looks at all the squares that contain an object, and adds up how wrong all the class predictions were for all of those squares.

Bounding Box Loss

the bounding box loss can be further broken down into two distinct loss functions. One for the "confidence" of a bounding box, and one for the bounding box coordinates.

Recall that the bounding boxes have a x and y coordinate, a width and height , and a confidence.

Various bounding boxes, where the most confident bounding boxes are bolder. From the YOLO paper.
Various bounding boxes, where the most confident bounding boxes are bolder. From the YOLO paper.

The coordinate loss is pretty straight forward, if you understand the class loss.

first of all, the only reason this loss function is broken up into two parts is because it’s long, not because it’s especially complicated.

Immediately we see a very similar expression to one we discussed recently.

this sums the loss for all squares, in the image (s) and all bounding boxes in each of those squares (B). But, it only ads the loss if an object exists in that square and that particular bounding box is responsible for that object.

The coordinate loss for a particular bounding box is very straight forward. It’s just like the difference of class probabilities discussed in the previous section, but for the bounding box dimensions (width, height, x, and y)

We'll discuss why the width and height are square rooted, and why the loss is scaled by λ, in the next section.
We’ll discuss why the width and height are square rooted, and why the loss is scaled by λ, in the next section.

So that’s the actual bounding box coordinates. Recall that bounding box loss comprises of both the coordinates and the confidence, where the confidence is some value that allows a bounding box to say "I should be responsible for this particular object".

The confidence loss is calculated with the following expression:

We have the same Σ expressions we’ve seen previously, where we’re adding the loss across all cells and all bounding boxes in each cell.

Except, this time, we have two separate filters. One for if the bounding box is supposed to be responsible for an object, and one for if the bounding box is not supposed to be responsible for an object.

A question might arise at this point. It’s simple what class probabilities should be, it’s whatever the dataset says a class is for a given cell in an image. It’s also simple what the bounding box dimensions should be, it’s whatever the dataset says they should be. But how do we compute what the confidence score of a bounding box should be? How do we say this bounding box should have been responsible, and this one should not have, for a given training example?

This is done by assigning whatever bounding box happens to have the highest "Intersection Over Union" (IOU) as responsible. The IOU is the intersection over union, and is essentially a measure of how well two bounding boxes overlap.

For two bounding boxes, the IOU is the intersection of the two areas divided by the total area of both
For two bounding boxes, the IOU is the intersection of the two areas divided by the total area of both

So, during training, the bounding box that should be responsible for detecting an object is whichever one already happens to be best at detecting that object. All other bounding boxes in that cell should not be responsible for that object.

In YOLO, the confidence should be 0 if a bounding box is not responsible for an object, otherwise the confidence should be the IOU of the bounding box with the ground truth bounding box.

Imagine a ground truth bounding box from the training dataset (black), and the model predicts two bounding boxes in the corresponding grid cell (blue and red). Because the blue square has the greatest IOU, the blue square should be responsible for this particular bounding box. Thus, the blue bounding boxes confidence should be the IOU between itself and the ground truth bounding box. The red bounding box is not responsible for a bounding box, so its confidence should be zero.
Imagine a ground truth bounding box from the training dataset (black), and the model predicts two bounding boxes in the corresponding grid cell (blue and red). Because the blue square has the greatest IOU, the blue square should be responsible for this particular bounding box. Thus, the blue bounding boxes confidence should be the IOU between itself and the ground truth bounding box. The red bounding box is not responsible for a bounding box, so its confidence should be zero.

It’s useful to reflect on bounding box dimension loss at this point. While, at any given step, the confidence should be whatever IOU the bounding box has vs the ground truth, over successive training steps the bounding boxes should be getting better. So, one would expect the model to learn to increase the confidence scores of bounding boxes as the model learns to predict better bounding boxes.

There’s a fascinating quirk of this system. Basically, because YOLO can choose which bounding box should be appropriate, it usually ends up making multiple bounding boxes each of which specialize in certain sized objects.

Anyway, that’s the loss function. Take a look at it, in all it’s glory, and reflect on how this one function is used to guide the training of the entire model.

The full loss function from the YOLO paper. x and y are bounding box location, w and h are bounding box size, C is bounding box confidence, and p(c) is the predicted probability of a given class. We'll cover λ, as well as some other nuances, in the next section. From the YOLO paper.
The full loss function from the YOLO paper. x and y are bounding box location, w and h are bounding box size, C is bounding box confidence, and p(c) is the predicted probability of a given class. We’ll cover λ, as well as some other nuances, in the next section. From the YOLO paper.

Intricacies of Training YOLO

While discussing the loss function I glossed over two intricacies; λs (lambdas) throughout the loss function, and the fact that the width and height of bounding boxes are square rooted.

λs and square roots within the loss function. Source
λs and square roots within the loss function. Source

The λs in the loss function are scaling parameters that tune the loss function to better train the model. If we think about some image that might appear in the dataset, most grid cells probably aren’t responsible for an object.

In this image, out of all grid squares, only two are responsible for an object.
In this image, out of all grid squares, only two are responsible for an object.

Two quirks might arise from this reality:

  • Instead of learning the difficult task of Object Detection, the model might just learn that most bounding boxes should have a confidence of zero, and would thus set all confidences to zero.
  • There are (S x S x B) predicted bounding boxes, but only a few of them matter. The model is trying to minimize the entire loss function, so if the loss by bounding box dimensions is a small number, it’s possible the model will hardly try to improve bounding box predictions throughout the training process.

The λ parameters solve both of these problems. By making λ_noobj smaller, and making λ_coord larger, the training function can be tuned to make squares that contain objects more important, and increase the importance of correct bounding box predictions. The authors of YOLO settled on the following:

λ_noobj = 0.5
λ_coord = 5

Another topic we glossed over is the square root of the width and height of the bounding boxes. Refer to the following examples of ground truth and predicted bounding boxes:

Two examples of imperfect predicted bounding boxes (red) vs ground truth bounding boxes (blue). Source1, source2
Two examples of imperfect predicted bounding boxes (red) vs ground truth bounding boxes (blue). Source1, source2

Most people would consider the left example to be worse than the right example, because the bounding box width is more wrong on the length than the right. However, both bounding boxes are both equally inaccurate. The difference between the width of the ground truth and the predicted bounding box is around 180 pixels for both images.

As humans, we don’t care about absolute difference of bounding box size, we care about relative difference of bounding box size. The bounding box on the left is smaller, so the acceptable error in width and height is smaller. The YOLO loss function takes the square root of the width and height to reduce the impact of larger bounding boxes, thereby making it more important to accurately predict small bounding box sizes.

Inference

So, we’ve trained YOLO. We gave it a bunch of images with bounding boxes, and computed the loss to inform the model how to update its parameters via back propagation.

Now that we have a trained YOLO model, let’s explore how to use it.

After we run an image through YOLO, we get back a tensor like this:

The output of YOLO.
The output of YOLO.

First of all, we can figure out what classes YOLO thinks corresponds to each grid cell by simply finding the maximum class prediction in each grid cell.

Class probabilities in an example. From the YOLO paper.
Class probabilities in an example. From the YOLO paper.

Each of those grid cells has a bunch of bounding boxes with confidence scores.

Various bounding boxes, where the most confident bounding boxes are bolder. From the YOLO paper.
Various bounding boxes, where the most confident bounding boxes are bolder. From the YOLO paper.

We can simply set some confidence threshold, like 0.8, to only preserve highly confident bounding boxes. And thus we have our output.

High confidence bounding boxes. From the YOLO paper.
High confidence bounding boxes. From the YOLO paper.

There are some quirks to this strategy, however. While YOLO can output multiple bounding boxes per square, facilitating the detection of small and close-by objects.

A conceptual diagram of YOLO predicting bounding boxes for multiple items per SxS grid cell. Source.
A conceptual diagram of YOLO predicting bounding boxes for multiple items per SxS grid cell. Source.

Unfortunately, a common issue is that multiple bounding boxes might be very confident about the same object.

Multiple bounding boxes confident on the same object. Source.
Multiple bounding boxes confident on the same object. Source.

First of all, we can define a threshold for bounding boxes that overlap too much. If the IOU between two bounding boxes is above 70%, for instance, then we can say they’re trying to predict the same object.

Once we’ve identified a group of bounding boxes as attempting to predict the same thing, we can apply "non-max suppression". Basically, we just disregard all but the bounding box with the highest confidence.

All three bounding boxes have an IOU over 70% with each other, and are thus identified as predicting the same thing. We can only keep the bounding box with the greatest confidence (blue) and disregard the bounding boxes with less confidence (red).
All three bounding boxes have an IOU over 70% with each other, and are thus identified as predicting the same thing. We can only keep the bounding box with the greatest confidence (blue) and disregard the bounding boxes with less confidence (red).

Conclusion

And that’s it. In this article we covered YOLO, an object detection model that unified object detection into a single task, allowing it to be both computationally efficient and highly performant. In understanding how YOLO achieved this, we explored the architecture of the model (including its large output tensor), the loss function, and some realities about training and inferencing the model.

This article focuses on the first YOLO paper. There have been many improvements made to YOLO since it’s publication. I’m planning on covering that in another IAEE article.

If you’re interested in YOLO, and want to learn more about some of the ethical concerns it raises, I have an opinion piece on that.

Responsibility in Artificial Intelligence

Join IAEE

At IAEE you can find:

  • Long form content, like the article you just read
  • Thought pieces, based on my experience as a data scientist, engineering director, and entrepreneur
  • A discord community focused on learning AI
  • Lectures, by me, every week
Join IAEE
Join IAEE

Related Articles