YOLOv4–5D: An Enhancement of YOLOv4 for Autonomous Driving

You only look once but five scales

Published in

Towards Data Science

5 min readSep 9, 2021

Object detection has become the hottest topic in deep learning and pattern recognition research for the last few years and has been the must-known problem for all computer vision researchers. If you are reading this post because you know what is the interesting thing in the post title, I believe you got some background in object detection, so I would like to ignore explaining the fundamental things like what is object detection and how many types of object detectors, the answers can be found easily in millions of sources obtained by typing very simple keywords on Google or whatever searching sites. But at least, I can start by summarizing the series of YOLO algorithms that have been the icon of object detection so far and is the most attractive baseline method that other approaches are improved based on.

The first version of YOLO was introduced by Joseph Redmon and his co-authors in 2015 which made a breakthrough in real-time object detection. YOLOv1 is a one-stage object detector with fast inference speed and acceptable accuracy compared with two-stage methods at that time. YOLOv2, also referred to as YOLO9000, was proposed one year later to improve the detection accuracy by applying the concept of anchor box. In 2016, further improvements were provided in YOLOv3 with a new backbone network Darknet53 and the capability of detecting objects at three different scales using Feature Pyramid Network (FPN) as the model neck. From the next version, YOLOv4, Joseph announced that he stopped going on this project due to some individual reasons and gave the leading privilege of YOLO project to Alexey Bochkovskiy, and Alexey introduced YOLOv4 in 2020. YOLOv4 has improved the performance of the predecessor YOLOv3 by using a new backbone, CSPDarknet53 (CSP stands for Cross Stage Partial), adding Spatial Pyramid Pooling (SPP), Path Aggregation Network (PAN), and introducing mosaic data augmentation method. You can have a look at YOLO project via the official website or the github repo darknet.

Network Architecture of YOLOV4 (figure in paper)

Currently, YOLOv4 has been the state-of-the-art model in the series of YOLO (there actually exists a version named YOLOv5, however, this version has not been confirmed as an official version due to some reasons, which can be found in this article). However, YOLOv4 is still not optimized for all scenarios; that is, in the case of the scene that has numerous small objects, YOLOv4 is still getting struggling and is not really accurate, for instance, in autonomous driving scenarios where there exist a lot of small and distant objects on the road like pedestrian, vehicles, traffic signs, etc. As in the title, this post introduces YOLOv4–5D, an improvement of YOLOv4 for autonomous driving scenarios.

What’s new in YOLOv4–5D:

Backbone: CSPDarknet53_dcn
Neck: PAN++
Head: 2 large-scale layers are added
Network pruning

Network Architecture of YOLOV4–5D (figure in paper)

1. Backbone: CSPDarknet53_dcn

CSPDarknet53 is the backbone of YOLOv4 which was the first model integrating Cross Stage Partial (CSP) structure into the backbone or feature extractor. The modified backbone introduced in YOLOv4–5D is re-designed by replacing conventional convolution in several layers with deformable convolution network (DCN) and is denoted as CSPDarknet53_dcn. Specifically, for the sake of balancing efficiency and effectiveness, 3x3 convolution layers in the last stage are replaced with DCN. The notable feature of DCN is that it uses a learnable offset value for the object feature direction description, by doing so, the receptive field is not limited to a fixed range and will be flexible to the variation of the target geometry. Also, DCN only affects marginally the number of parameters in the model. With the mentioned characteristics, DCN is integrated into the backbone of YOLOv4–5D to form the CSPDarknet53_dcn.

Deformable Convolution (figure in paper)

2. Neck: PAN++

Different from YOLOv4 which uses PAN as a part of the model neck (along with SPP). In YOLOv4–5D, PAN++ is used and is designed as the feature fusion module. PAN++ is applied to leverage both the semantic information of the low-level features and the location information of the high-level features in the backbone. The whole network is designed to output 5 different scales of detections, which benefit small object detection.

PAN++ in YOLOv4–5D (figure is adapted from paper)

3. Head: 2 large-scale layers are added

As mentioned above, the purpose of adding 2 more large-scale detection layers is to enhance the capability of detecting small objects.

Two Large-scale for Better Small Object Detection (in red box) are added in YOLOv4–5D (figure is adapted from paper)

4. Network pruning

The concept of sparse scaling factor in batch normalization is used for channel pruning to prune the backbone of YOLOv4–5D. Because the scaling parameter γ is a learnable factor, it is able to present the channel importance. A small pruning threshold is set, 0.1 in general. Which channel has γ lower than 0.1 will be pruned.

5. Results

The comparison between the performance of YOLOv4–5D and YOLOv4 is shown in the below table:

YOLOv4–5D vs YOLOv4 on BDD and KITTY Datasets (table in paper)

YOLOv4–5D has improved the performance of YOLOv4 by a significant gap. In BDD dataset, the overall mAP at IoU 0.5 is improved from 65.90% to 70.13%, the amount of improvement is 4.23%. In KITTY dataset, YOLOv4–5D produces higher detection performance with 87.02% mAP compared to the original YOLOv4 with 85.34% mAP, the gap is 1.68%.

Further performance comparison of YOLOv4–5D with other state-of-the-art methods are shown in the following tables:

YOLOv4–5D vs Other Methods on BDD Validation Data (table in paper)

YOLOv4–5D vs Other Methods on KITTY Validation Data (table in paper)

Finally, by applying model pruning, the inference speed of YOLOv4–5D is improved significantly by 31.3% while the accuracy is maintained.

Pruned YOLOv4–5D Performance (table in paper)

Conclusions

In this post, I have introduced YOLOv4–5D, an improvement of YOLOv4 for object detection in autonomous driving scenarios. YOLOv4–5D shows higher performance than that of YOLOv4 with improving mAP by 4.23% on BDD dataset and by 1.68% on KITTY dataset. Moreover, the pruned version of YOLOv4–5D further improves the inference speed by 31.3% with the memory of only 98.1MB while maintaining the same accuracy.

Readers are welcome to visit my Facebook fan page which is for sharing things regarding Machine Learning: Diving Into Machine Learning. Another post from me regarding executing YOLOv4 object detection with Darknet and Tensorflow-Keras can also be found here.

Thanks for spending time!