Object Detection Basics — A Comprehensive Beginner’s Guide (Part 1)

Learn the basics of this advanced computer vision task of object detection in an easy to understand multi-part beginner’s guide

Published in

Towards Data Science

9 min readFeb 5, 2024

Driving a car nowadays with the latest drive assist technologies for lane detection, blind-spots, traffic signals and so on is pretty common. If we take a step back for a minute to appreciate what is happening behind the scenes, the Data Scientist in us soon realises that the system is not just classifying objects but also locating them in the scene (in real-time).

Such capabilities are prime examples of an object detection system in action. Drive assist technologies, industrial robots and security systems all make use of object detection models to detect objects of interest. Object detection is an advanced computer vision task which involves both localisation [of objects] as well as classification.

In this article, we will dive deeper into the details of the object detection task. We will learn about various concepts associated with it to help us understand novel architectures (covered in subsequent articles). We will cover key aspects and concepts required to understand object detection models from a Transfer Learning standpoint.

Key Concepts and Building Blocks

Object detection consists of two main sub-tasks, localization and classification. Classification of identified objects is straightforward to understand. But how do we define localization of objects? Let us cover some key concepts:

Bounding Boxes

For the task of object detection, we identify a given object’s location using a rectangular box. This regular box is termed as a bounding box and used for localization of objects. Typically, the top left corner of the input image is set as origin or (0,0). A rectangular bounding box is defined with the help of its x and y coordinates for the top-left and bottom right vertices. Let us understand this visually. Figure 1(a) depicts a sample image with its origin set at its top left corner.

Figure 1(b) shows each of the identified objects with their corresponding bounding boxes. It is important to note that a bounding box is annotated with its top-left and bottom-right coordinates which are relative to the image’s origin. With 4 values, we can identify a bounding box uniquely. An alternate method to identify a bounding box is to use top-left coordinates along with its width and height values. Figure 1(c) shows this alternate way of identifying a bounding box. Different solutions may use different methods and it is mostly a matter of preference of one over the other.

Object detection models require bounding box coordinates for each object per training sample apart from class label. Similarly, an object detection model generates bounding box coordinates along with class labels per identified object during inference stage.

Anchor Boxes

Every object detection model scans through a large number of possible regions to identify/locate objects for any given image. During the course of training, the model learns to determine which of the scanned regions are of interest and adjust the coordinates of these regions to match the ground truth bounding boxes. Different models may generate these regions of interest differently. Yet, the most popular and widely used method is based on anchor boxes. For every pixel in the given image, multiple bounding boxes of different sizes and aspect ratios (ratio of width to height) are generated. These bounding boxes are termed as anchor boxes. Figure 2 illustrates different anchor boxes for particular pixel in the given image.

Anchor box dimensions are controlled using two parameters, scale denoted as s 𝜖 (0,1] and aspect ratio denoted as r >0. As shown in figure 2, for an image of height and width h ⨉ w and specific values of s and r, multiple anchor boxes can be generated. Typically, we use the following formulae to get dimensions of the anchor boxes:

wₐ=w.s√r
hₐ = h.s / √r

Where wₐ and hₐ are the width and height of the anchor box respectively. Number and dimensions of anchor boxes are either predefined or picked up by the model during the course of training itself. To put things in perspective, a model generates a number of anchor boxes per pixel and learns to adjust/match them with ground truth bounding box as the training progresses.

Bounding boxes and anchor boxes are key concepts to understand the overall object detection task. Before we get into the specifics of how such architectures work, let us first understand the way we evaluate the performance of such models. The following are some of the important evaluation metrics used:

Intersection over union (IOU)

An object detection model typically generates a number of anchor boxes which are then adjusted to match the ground truth bounding box. But how do we know when the match has happened or how well the match is?

Jaccard Index is a measure used to determine the similarity between two sets. In case of object detection, Jaccard Index is also termed as Intersection Over Union or IOU. It is given as:

IOU = | Bₜ ∩ Bₚ | / | Bₜ ∪ Bₚ |

Where Bₜ is the ground truth bounding box and Bₚ is the predicted bounding box. In simple terms it is a score between 0 and 1 determined as the ratio of area of overlap and area of union between predicted and ground truth bounding box. The higher the overlap, the better the score. A score close to 1 depicts near perfect match. Figure 3 showcases different scenarios of overlaps between predicted and ground truth bounding boxes for a sample image.

Figure 3: Intersection Over Union (IOU) is a measure of match between the predicted and ground-truth bounding box. The higher the overlap, the better is the score. Source: Author

Depending upon the problem statement and complexity of the dataset, different thresholds for IOU are set to determine which predicted bounding boxes should be considered. For instance, an object detection challenge based on MS-COCO uses an IOU threshold of 0.5 to consider a predicted bounding box as true positive.

Mean Average Precision (MAP)

Precision and Recall are typical metrics used to understand performance of classifiers in machine learning context. The following formulae define these metrics:

Precision = TP / (TP + FP)
Recall = TP/ (TP + FN)

Where, TP, FP and FN stand for True Positive, False Positive and False Negative outcomes respectively. Precision and Recall are typically used together to generate Precision-Recall Curve to get a robust quantification of performance. This is required due to the opposing nature of precision and recall, i.e. as a model’s recall increases its precision starts decreasing. PR curves are used to calculate F1 score, Area Under the Curve (AUC) or average precision (AP) metrics. Average Precision is calculated as the average of precision at different threshold values for recall. Figure 4(a) shows a typical PR curve and figure 4(b) depicts how AP is calculated.

a) A typical PR-curve shows model’s precision at different recall values. This is a downward sloping graph due to opposing nature of precision and recall metrics; (b) PR-Curve is used to calculate aggregated/combined scores such as F1 score, Area Under the Curve (AUC) and Average Precision (AP); © mean Average Precision (mAP) is a robust combined metric to understand model performance across all classes at different thresholds. Each colored line depicts a different PR curve based on specific I — Figure 4: a) A typical PR-curve shows model’s precision at different recall values. This is a downward sloping graph due to opposing nature of precision and recall metrics; (b) PR-Curve is used to calculate aggregated/combined scores such as F1 score, Area Under the Curve (AUC) and Average Precision (AP); (c.) mean Average Precision (mAP) is a robust combined metric to understand model performance across all classes at different thresholds. Each colored line depicts a different PR curve based on specific IOU threshold for each class. Source: Author

Figure 4(c) depicts how average precision metric is extended to the object detection task. As shown, we calculate PR-Curve at different thresholds of IOU (this is done for each class). We then take a mean across all average precision values (for each class) to get the final mAP metric. This combined metric is a robust quantification of a given model’s performance. By narrowing down performance to just one quantifiable metric makes it easy to compare different model’s on the same test dataset.

Another metric used to benchmark object detection models is frames per second (FPS). This metric points to the number of input images or frames the model can analyze for objects per second. This is an important metric for real-time use-cases such as security video surveillance, face detection, etc.

Equipped with these concepts, we are now ready to understand the general framework for object detection next.

Object Detection Framework

Object detection is an important and active area of research. Over the years, a number of different yet effective architectures have been developed and used in real-world setting. The task of object detection requires all such architectures to tackle a list of sub-tasks. Let us develop an understanding of the general framework to tackle object detection before we get to the details of how specific models handle them. The general framework comprises of the following steps:

Region Proposal Network
Localization and Class Predictions
Output Optimizations

Let us now go through each of these steps in some detail.

Regional Proposal

As the name suggests, the first and foremost step in the object detection framework is to propose regions of interest (ROI). ROIs are the regions of the input image for which the model believes there is a high likelihood of an object’s presence. The likelihood of an object’s presence or absence is defined using a score called objectness score. Regions which have objectness score greater than a certain threshold are passed onto the next stage while others are reject.

For example, take a look at figure 5 for different ROIs proposed by the model. It is important to note that a large number of ROIs are generated at this step. Based on the objectness score threshold, the model classifies ROIs as foreground or background and only passes foreground regions for further analysis.

Figure 5: Regional Proposal is the first step in object detection framework. Regions of Interest are highlighted as red rectangular boxes. The model marks regions with high likelihood of an image (high objectness score) as foreground regions and rest as background regions. Source: Author

There are a number of different ways of generating regions of interest. Earlier models used to make use of selective search and related algorithms to generate ROIs while newer and more complex models make use of deep learning models to do so. We will cover these when we discuss specific architectures in the upcoming articles.

Localization And Class Predictions

Object detection models are different from the classification models we typically work with. An object detection model generates two outputs for every foreground region from the previous step:

Object Class: This is the typical classification objective to assign a class label to every proposed foreground region. Typically, pre-trained networks are used to extract features from the proposed region and then use those features to predict the class. State-of-the-art models such as the ones trained on ImageNet or MS-COCO with a large number of classes are widely adapted/transfer learnt. It is important to note that we generate a class label for every proposed region and not just a single label for the whole image (as compared to a typical classification task)
Bounding Box Coordinates: A bounding box is defined a tuple with 4 values for x, y, width and height. At this stage the model generates a tuple for every proposed foreground region as well (along with the object class).

Output Optimization

As mentioned earlier, an object detection model proposes a large number of ROIs in step one followed by bounding box and class predictions in step two. While there is some level of filtering of ROIs in step one (foreground vs background regions based on objectness score), there are still a large number of regions used for predictions in step two. Generating predictions for such a large number of proposed regions ensures good coverage for various objects in the image. Yet, there are a number of regions with good amount of overlap for the same region. For example, look at the 6 bounding boxes predicted for the same object in figure 6(a). This potentially can lead to difficulty in getting the exact count of different objects in the input image.

Figure 6 (a)Object detection model generating 6 bounding boxes with good overlap for the same object. (b) Output optimized using NMS. Source: Author

Hence, there is a third step in this framework which concerns the optimization of the output. This optimization step ensures there is only one bounding box and class prediction per object in the input image. There are different ways of performing this optimization. By far, the most popular method is called Non-Maximum Suppression (NMS). As the name suggests, NMS analyzes all bounding boxes for each object to find the one with maximum probability and suppress the rest of them (see figure 6(b) for optimized output after applying NMS).

This concludes a high-level understanding of a general object detection framework. We discussed about the three major steps involved in localization and classification of objects in a given image. In this next article we will use this understanding to understand specific implementations and their key contributions.