We learned what is image classification and how to create image classification models in my previous post.
Now it’s time to go one step forward to learn about Object Detection.
Object Detection (Object Recognition)
- While the image classification problem focuses on classifying the images, in 1 image there may be more than 1 class we are searching for and in Object Recognition, our task is to find all of them placed in the most appropriate boxes.
Bounding Box – ROI (Region of interest): A new term we need to meet for object recognition. While we try to recognize the objects, we use bounding boxes in which the object will be possibly detected. We will learn how to obtain as possible as close boxes to the object detected later.

- As you may notice, object recognition is a bit more complex task than image classification where we try to localize and recognize the object in images.

To resume again these 3 different tasks:
Image Classification: Predict the class of an object in an image
Object Localization: Locate the presence of objects in an image and indicate their location with a bounding box.
Object Detection: Locate the presence of objects with a bounding box and detect the classes of the located objects in these boxes
Object Recognition Neural Network Architectures created until now is divided into 2 main groups: Multi-Stage vs Single-Stage Detectors.
Multi-Stage Detectors
- RCNN 2014
- Fast RCNN 2015
- Faster RCNN 2015
Single-Stage Detectors
- SSD 2016
- Yolo 2016
- YOLOv2 2016, YOLOv3 2018, YOLOv4 2020, YOLOv5 2020
Let’s start with examining Multi-Stage Detectors 🔮
Multi-Stage Detectors
As we understand from the name, these are the object detectors having 2 separate stages. In general, they start with first extracting some region of interest (bounding boxes) then applying classification on these boxes to obtain the final result. This is the reason they are called Region-Based Convolutional Neural Networks (RCNN)
- RCNN
As the first member of this family, we will see the base methodology and the improvement of its disadvantages in further versions. The methodology of this model is as follows:
- Region Proposal Extraction from Input Image using Selective Search
The way the selective search algorithm works is that it applies a segmentation algorithm to find blobs in an image to figure out what could be an object. Selective search recursively combines these groups of regions together into larger ones to create 2,000 areas to be investigated.

Selective Search For Segmentation
In the following post, I mentioned different algorithms to apply image segmentation without using AI-based methods but using Classical Computer Vision-Based methods.
Image Segmentation with Classical Computer Vision-Based Approaches
Being Selective Search in the same category, it applies the following steps to segment the image:
· First the similarities between all neighboring regions are calculated.
· The two most similar regions are grouped together, and new similarities are calculated between the resulting region and its neighbors.
· This process is then repeated until the whole object is covered in a single region.
- Feature Extraction using CNN on each ROI comes from the previous step
After extracting almost 2000 possible boxes which may have an object according to the segmentation, CNN is applied to all these boxes one by one to extract the features to be used for classification at the next step
- Classification with SVM and Bounding Box Prediction
Finally, using SVM (support vector machine) for classification and a bounding box regressor, the model gives us the final bounding boxes along with detected classes where the bounding box regressor’s task is just to improve the proposed box to encircle the object better.

This is the main methodology of RCNN, now let’s take a look at its weak points
- Selective Search is already a complex algorithm and using this only for the first step increase the model computational cost too much. -> SLOW
- Obtaining 2000 regions to apply feature extraction 1 by 1 is too much computational again! -> SLOW
which gives us 47 seconds for 1 image detection, therefore it’s not possible to use this model for real-time object detection tasks!
- It’s not an end-to-end trainable model since Selective Search Algorithm is not a trainable method which makes it impossible to develop region proposals by training RCNN. -> NOT TRAINABLE for some part
- Using SVM is another reason for not having an end-to-end architecture where we need to train SVM and CNN separately which poses a more difficult task.
As a result, although it was a state-of-art architecture back in time having better accuracies than the previous models, it’s clear that the model needs to get improved for especially speed performances. For this reason, we find ourselves examining the Fast RCNN model which is an improved version of RCNN.
- Fast RCNN
- It changes the order of the region proposal step and feature extraction so that we first apply CNN to the input image, then extract the ROIs. This way, we don’t apply CNN to 2000 different region but only once which increase the speed performance of the model. -> NOT SO SLOW ANYMORE
- Another change that comes with Fast RCNN is to use a fully connected layer with a softmax output activation function instead of SVM which makes the model more integrated to be a one-piece model. -> TRAINING IS IN SINGLE-STEP
- To adapt the size of the region comes from the region proposals to the fully connected layer, ROI maximum pooling is applied.
- Both bounding box regressor and classification tasks were implemented with fully connected layers, therefore "multi-task loss" is applied where these 2 different outputs cost combined to update the model. -> TRAINING IS IN SINGLE-STAGE
As a result, we obtain 9× faster than RCNN for training and 0.32 second for detection of 1 image.
Although this is good news, Fast RCNN still had one drawback of RCNN! Even if we don’t apply CNN for each proposed region in Fast RCNN, extracting 2000 region proposals using selective search is still a problem that makes the model unnecessarily complex.
- Faster RCNN
As you can imagine, the main change in Faster RCNN was to use another method than Selective Search Algorithm to propose regions. This method was to use a separate neural network to learn region proposals.
Therefore, the model first uses 1 CNN to obtain region proposals, then follows exactly the same logic with Fast RCNN to detect objects, so it uses these proposals to extract features with CNN, then classify with Fully Connected Layers and improve the ROIs using bounding box regressor.
For you to know, Faster-RCNN is called RPN + Fast-RCNN in some resources since the only update from Fast to Faster RCNN is the way of region proposal method.
RPN: Region Proposal Network. In this network used inf Fast RCNN, the main goal is to predict the offsets of anchor boxes to obtain final bounding boxes. Where anchor boxes are some default size and shape boxes.
With these improvements step by step, we see the speed of the model increase a lot and the accuracy especially in Faster-RCNN

mAP: Mean Accuracy Percentage is the most common accuracy metric for object detection models where accuracy for each class is calculated then the mean is taken. For more information about mAP calculation, you can take a look at this great post.
Single-Stage Detectors
Now it’s time to examine single-stage detectors where the box prediction and classification are carried out at the same time in contrary to multi-stage detectors.
SSD: Single Shot MultiBox Detector
- A standard pre-trained neural network is used as a feature extractor like VGG16, VGG19, or ResNet.
- After this CNN, some additional convolutional layers are applied to obtain the different sizes of feature maps. Some of these feature maps and the output maps from the feature extraction network are sent to the image classification part which consists of Fully Connected Layer after obtaining bounding boxes.
- For each cell in these feature maps, we extract 4 or 6 bounding boxes send them directly to the Fully Connected Layer. In contrary to multi-stage detectors, we don’t do any prediction for bounding boxes, we only take feature maps that come from different convolutional layers and transform them into the grid cells where each pixel is 1 cell. Having these cells as the center point, we produce 4 or 6 different bounding boxes.
- To produce bounding boxes we have anchor boxes being fixed size and shape, predefined boxes.

- A non-maximum suppression mechanism is applied as the last step which decides if there is more than 1 box pointing to the same object, we choose the best fit.

- To learn more about the Non-Maximum suppression mechanism and how to choose the best-fit bounding box, you can read this nice post:
Let’s visualize SSD architecture:

- We see that there is 8732 bounding box comes from the whole architecture and we need to decide for each box if there is a class object or not. The reason we have 8732 bounding box is:
Conv4_3: 38×38×4 = 5,776 boxes (4 boxes for each cell) Conv7: 19×19×6 = 2,166 boxes (6 boxes for each cell) Conv8_2: 10×10×6 = 600 boxes (6 boxes for each cell) Conv9_2: 5×5×6 = 150 boxes (6 boxes for each cell) Conv10_2: 3×3×4 = 36 boxes (4 boxes for each cell) Conv11_2: 1×1×4 = 4 boxes (4 boxes for each cell)
5,776 + 2,166 + 60CH0 + 150 + 36 + 4 = 8732
- SSD achieves 74.3% mAP1 on VOC2007 with 59 FPS whereas Faster RCNN had 10 FPS. This is very impressive progress for speed right?!
FPS vs Second per Image: We compared the RCNN family using the "second" unit for their speed since their papers mostly use this one. The second per image is to explain how many seconds it takes to predict 1 image. So the less the value the better performance it means. On the other hand, FPS means frame per second and this unit explains how many frame-image we can predict in 1 second. So the more the value the better performance it means.
YOLO: You Only Look Once
The basic idea is very similar to SSD and here are some differences that YOLO have:
- It uses a custom network based on the Googlenet architecture name Darknet. This Darknet architecture consists of 24 convolutional and 2 fully connected layers.
- The convolutional parts are pre-trained with the Imagenet-1000 dataset and the feature maps that come from the CNN are used to transform to grid cells just like in SSD. Fully connected layers are used to predict bounding box coordinates from these feature map grids.
- In the end, we have 98 predictions for 1 class whereas we had 8732 per 1 class in SSD but YOLO has 45 FPS and 63.4mAP, so both are lower than SSD.
YOLO was published before SSD, so we can say that it was the first single-stage detector and while SSD doesn’t have any updated version, YOLO has 5 different versions until now. Let’s check how this model got improved over time.
YOLO had some weak points like:
- Localization (So there is a problem with bounding boxes…)
- Scale problem (Different size of the objects from the training dataset was an issue)
YOLOv2
67 FPS and 76.8 mAP were obtained! Nice improvement but how?
- Batch Normalization is added to convolutional layers.
- High-Resolution Classifier: The original YOLO trains the classifier network at 224 × 224 and increases the resolution to 448 for detection. This means the network has to simultaneously switch to learning object detection and adjust to the new input resolution. For YOLOv2 the input image resolution is increased to 448 × 448.
- Anchor Boxes: Instead of predicting bounding box coordinates using a fully connected layer, anchor boxes are used to calculate the offset. But not exactly like other anchor box approaches instead of choosing priors by hand, they run k-means clustering on the training set bounding boxes to automatically find good priors for the anchor boxes.
- The base model architecture changed a little with the name Darknet-19 having 19 convolutional layers and 5 max-pooling layers and the fully connected layers are removed as I mentioned above 🐟
YOLOv3
- It is used logistic classifiers (like sigmoid!) instead of softmax like previous versions to make it available that 1 object belongs to more than 1 class. For example, 1 object detected can be both dog and animal right? If the prediction is higher than a given threshold, these predictions pass to the output so you can see 1 bounding box represents 2 classes at the same time!
- Unlike YOLOv2 assigns only 1 box (the best one) to a ground truth object, in this one if a bounding box passes a given threshold (0.5 is used) goes to the loss calculation for bounding box prediction, objectness, and class prediction. If it can’t pass it only goes for box prediction. (take a look at the Bounding Box Prediction subtitle in the paper!)
- As the feature extractor Darknet-53 is used instead of Darknet-19 which is a little bit deeper model, more accurate less fast.
In the paper, I couldn’t find accuracy and speed metrics for the VOC dataset which I used to compare with previous models until now. So I can only share some results on the COCO dataset
YOLOv3–320gives 51.5 mAP with inference time 22ms, YOLOv3–416 gives 55.3 mAP with inference time 29 ms, YOLOv3–608 gives 57.9 mAP with inference time 51 ms,
whereas
Faster RCNN gives 59.1 mAP with an inference time of 172 ms and
YOLOv2–608 gives 48 mAP with an inference time of 40 ms
Note that -320, -416, -608 is the input image resolution.
⏰ 👽 A little pause before going further! As much as I examine the papers, I realize that some of the "accuracy" metrics may be confusing since some papers give VOC results, some of them COCO results with still different architectures (small version, bigger version, some different experiments with different backbones, etc). I can say that Faster RCNN has usually better accuracy than YOLO models, while YOLO models are almost always faster than it until now!
Backbone: The feature extractor part of object detection models. Usually, the image classification architectures we saw in the previous post like VGG, Resnet, etc
YOLOv4
- YOLOv4 improves the YOLOv3 model by using BoF (bag of freebies) and several BoS (bag of specials). The BoF improves the accuracy of the detector, without increasing the inference time. They only increase the training cost. On the other hand, the BoS increase the inference cost by a small amount while they significantly improve the accuracy of object detection.
- The model uses Cross Stage Partial Network (CSPNet) in Darknet, creating a new feature extractor backbone called CSPDarknet53. The convolution architecture is based on modified DenseNet.
- As a result, YOLOv4 reaches %10 more accuracies and %12 faster than YOLOv3 in terms of FPS.
Bag Of Freebies: "We call these methods that only change the training strategy or only increase the training cost as "bag of freebies." " [1] For more detailed research about the effects of Bag Of Freebies for Object Detection model training please refer to this paper: https://arxiv.org/pdf/1902.04103.pdf
YOLOv5
This one is maybe the most discussed and notable YOLO model due to not having an official paper but have very impressive results obtained by Roboflow start-up company where they compare YOLOv5 with YOLOv4.
It uses Pytorch instead of Darknet implemented in C.
According to their results:
- YOLOv5 is almost 3x faster than YOLOv4!
- YOLOv5 is nearly %90 smaller than YOLOv4!
and some key features of YOLOv5 are as follows:
- Uses CSPDarknet53 as YOLOv4
- Mosaic Data Augmentation is added to the model
Before finishing the theoretical part, I would like to add that YOLO has some "tiny" versions especially used when you want to have super quick models. Almost every version (v2,v3,v4,v5) of YOLO has its tiny-YOLO version having less accuracy but being almost x4 faster. Worth trying for Real-Time Object Detection projects!
The theoretical part is done! It’s time to learn how to use these models in real to make object detection for your own dataset.
For this part, I will share two very good repositories for Faster RCNN and YOLOV4.
Object Detection with Faster RCNN
I used a custom implemented repository since I couldn’t find any official implementation for Faster RCNN. It’s a very simple repo to download and use for train and test implemented with Keras.
In the following GitHub link, you will find my repo forked from the base source where I add some more properties and a detailed explanation of how to build and use this implementation.
GitHub – YCAyca/Faster_RCNN_for_Open_Images_Dataset_Keras: Faster R-CNN for Open Images Dataset by…
Object Detection with YOLOV4
For YOLOV4, I used the official implementation which is based on C, and forked for my GitHub repo where I added instructions for how to build and use YOLOV4. This implementation was pretty difficult to understand and build so I strongly recommend taking a look at the repo and following the steps 1 by 1 to be able to train and test your own dataset.
GitHub – YCAyca/darknet: YOLOv4 / Scaled-YOLOv4 / YOLO – Neural Networks for Object Detection…
Thank you for your attention and I hope it was helpful to start using Object Detection models by yourself!
Selective Search Algorithm official paper: https://ivi.fnwi.uva.nl/isis/publications/2013/UijlingsIJCV2013/UijlingsIJCV2013.pdf
RCNN official paper: https://arxiv.org/pdf/1311.2524.pdf
Fast RCNN official paper: https://arxiv.org/pdf/1504.08083.pdf
Faster RCNN official paper: https://arxiv.org/pdf/1506.01497.pdf
SSD official paper: https://arxiv.org/abs/1512.02325
YOLO official paper: https://arxiv.org/pdf/1506.02640.pdf
YOLOv2 official paper: https://arxiv.org/pdf/1612.08242.pdf
YOLOv3 official paper: https://arxiv.org/pdf/1804.02767.pdf
[1] Yolov4 official paper :https://arxiv.org/pdf/2004.10934.pdf
Yolov5 doesn’t have an official paper yet!