
I have spent the last 3 months diving deep into object detection. I have tried tons of stuff ranging from implementing state-of-the-art models like YoloV5, VFNets, DETR to fusing object detection models with image classification models to boost performance. I struggled a lot during the early stages of the competition to improve the score of a baseline model and I couldn’t quite find useful online resources to do so and that is why I am writing this article. I want to take you on a journey from the start till the end, show you briefly each step of the way I took to almost double my score.
The official competition metric was (mean) Average Precision which is one of the most commonly used object detection metrics. To show you the improvement of each step over the other, I am going to add its score next to it.
- The first step is to build a SIMPLE baseline – 0.126 mAP
I am pretty sure this is a trap that tons of data scientists fall into earlier. We are always excited to start with the most complex model with every technique we could think of. This is a huge mistake, you will end up getting frustrated and leaving your ML project and even if you don’t you will most probably overfit!
I learned the lesson the hard way, but I ended up building an initial model with these specifications:
- YoloV5-XL
- Images resized from 3K resolution to 512
I know this sounds quite simple, and initially, I was thinking the same. But, practically, building the baseline is perhaps one of the most annoying steps. Because there are multiple steps such as post-processing the output into the competition’s format and much more (that I don’t want to dive into).
Also, my actual initial YoloV5-XL model only got 0.064 (half of the above) and I spent 2 weeks debugging it only to find out that I wasn’t normalizing the input data correctly!
2. Drop one of the input classes! – 0.143 mAP (+13%)
This trick didn’t make much sense to me at the time. There were 14 input classes, 13 different diseases, and 1 "No Finding" class. Around 70% of the dataset belonged to the "No Finding" class and only 30% belonged to the other classes. A competitor discovered that you can drop this class and predict it using the "2 class filter" trick (see below). This allows the dataset to be much less skewed. Moreover, it allows the training to be significantly faster (since you will be training on fewer images).
3. Increase training and inference image resolution – 0.169 mAP (+18%)
The 2nd step was to increase the image resolutions to 1024 instead of 512. This was quite a trivial improvement, but the point I want to convey here is that if I had started with this resolution, I probably wouldn’t have increased my score further. Simply because training on this higher resolution resulted in decreasing the batch size from 16 to 4 (to not run out of GPU memory) which slowed down the training process by A LOT. This meant slower experiments and you don’t want to start a competition with slow experiments…
4. Fusing EfficientNet and YoloV5–0.196 mAP (+16%)
This wasn’t my idea, I got it from a public kernel. But, it was one of the best ideas I have encountered during Kaggle competitions. And I wanted to highlight that one of the main benefits of doing competitions on Kaggle is the lessons that you learn from the community.
The main idea here was to train an image classification model (EfficientNet) here that can achieve a very high AUC (around 0.99) and figure out a way to fuse it with the object detection model. This was called the "2 class filter" and literally everyone in the competition adopted this idea because it boosted the score significantly.
If you are interested, you can find it out more here:
Fusing EfficientNet & YoloV5 – Advanced Object Detection 2 stage pipeline tutorial
5. Weighted boxes fusion (WBF) post-processing – 0.226 mAP (+15%)
This was also quite a new idea to me and not one that you can find online easily. Weighted boxes fusion is a technique to filter down the number of boxes that object detection models produced so that the results are more accurate and more correct. It surpasses the performance of existing similar methods such as Non-maximum suppression (NMS) and soft-NMS.
The results of applying WBF looked something like this:

And if you are interested to find out more, you can check out my article here:
WBF: Optimizing object detection – Fusing & Filtering predicted boxes
- 5-fold Cross-validation fused with WBF— 0.256 mAP (+13%)
One of the biggest mistakes that I made is that I forgot to do cross-validation, which is why this is a bullet point rather than a numbered point. And this is one of the main reasons why I wanted to write this article is to stress the importance of ML basics. I was so focused on applying new techniques and boosting the performance that I have forgotten to apply this basic ML technique.
If you are wondering how I attached the 0.256 to it, it’s because I read some of the solutions that were released after the competition has ended, and this is what most of them got after cross-validation on models similar to mine.
The final pipeline can be seen here:

- Stuff that I tried, but couldn’t figure out
- Training DETR. DETR is an amazing object detection transformer that I have written an article about before. And I thought this was the moment to put all that I have written into practice and sincerely test it. However, I didn’t find the documentation they provided with their code that helpful, and I couldn’t find a lot of helpful resources. Moreover, I spent around 3 weeks (which is around a quarter of the competition’s duration) trying to get it to work. And the reason I am stating this is that although it might be hard to leave a solution that you have been working on, in the experimental ML world this sometimes has to be done, and to be honest I wish I had left it earlier. But, on the bright side, I discovered that another library called MMDetection provides DETR, and it’s much easier to work with.
- WBF pre-processing. Although a lot of competitors said that this boosted their score, it didn’t boost mine. And that’s the thing about ML, not all techniques necessarily benefit different models the same way.
Finally, you can find my code here if you are interested:
https://github.com/mostafaibrahim17/VinBigData-Chest-Xrays-Object-detection-
Final thoughts
I hope you have enjoyed the article and learned from my mistakes. I hope to see more people sharing their competition experiences because this is something I have found quite helpful. There are tons of tutorials online on building baselines, but a very small amount about what to do next.