The world’s leading publication for data science, AI, and ML professionals.

WBF: Optimizing object detection - Fusing & Filtering predicted boxes

Weighted boxes fusion has become the new SOTA to optimize object detection models

Thoughts and Theory

Object detection is one of the most common and most interesting computer vision tasks. Recent SOTA models like YOLOv5 and EfficientDet are quite impressive. This article is going to be about a new SOTA novel technique called weighted boxes fusion that optimizes a huge object detection problem. This is an advanced object detection technique, and the way I came across it is during the current running VinBigData Kaggle competition.

If you are familiar with how object detection works, you probably know that there is always a backbone CNN that extracts features. And there is another stage that either generates region proposals (possible bounding boxes) or filters already proposed ones. The main issue here is that this isn’t a straightforward task at all, it’s actually quite difficult. This is why object detection models either end up generating a lot of bounding boxes or not enough ones, ending up with a low mean average precision. Several algorithms have been proposed to deal with this issue, which I am going to going through first.

If you aren’t interested in the theoretical explanation, you can just skip to the coding tutorial at the end where I will be showing how to apply this technique to a very challenging dataset for a competition that I have been working on (VinBigData).

To give you a bit of context, the competition is about detecting lung diseases on X-rays. Your model has to be able to distinguish between 14 different diseases though, and for each disease predict a bounding box for where that disease is. It gets even more difficult as there can be more than one disease for each image (and thus you have to predict multiple different bounding boxes).

The first issue is that the diseases are labeled by multiple different radiologists and there can be multiple different bounding boxes for the same abnormality. So we have to filter these (or in this case "fuse" them) to avoid confusing our model.

The second issue is that some dense disease areas contain multiple labels, meaning that a small bounding box can have multiple disease labels. This makes our lives a bit difficult if we are using something like NMS since we are filtering boxes bases on IoU (which will quite often happen here). So a method like NMS is likely to remove useful boxes.

Note that all of these techniques can be used in 2 ways. Either to pre-process your data to filter out unprecise labeled bounding boxes (which is what I will be doing here) or to filter out the bounding boxes predicted by a model you trained to improve the accuracy (or both).

I will be discussing each technique and including a visualization of the bounding boxes before and after using the technique.

  1. Non-maximum suppression (NMS)

If you are familiar with object detection, you have probably heard of NMS. Given that each prediction of an object detection model consists of bounding box coordinates, class label, and confidence score, NMS works as follows:

  • The boxes are filtered into 1 box if their Intersection Over Union IoU is higher than the specified threshold hold hyperparameter. IoU is essentially the amount of overlap between the 2 boxes.

The main challenge here is that if the objects are side by side, one of them would be eliminated (since the IoU will be quite high).

  1. Soft-NMS

The second method tries to solve the main problem of NMS by going with a more "soft" approach. Instead of completely removing the boxes with a higher IoU than the threshold, it lowers their confidence scores according to the IoU value.

  1. Weighted Boxes Fusion (WBF)

The WBF algorithm works in a different way than NMS. It’s a bit long and it does involve a lot of math equations, but I will do my best to give you a simple overview without boring you with the details.

First, it sorts all of the bounding boxes in decreasing order of confidence scores. It then generates another list of possible "fusions" (combinations) of boxes and tries to check if any of those fusions match the original bounding boxes. It does that by checking if the IoU is bigger than a specified threshold (hyperparameter).

It then uses a formula to adjust the coordinates and the confidence scores of all the boxes in the fused boxes list. The new confidence score is simply the average confidence score of all of the boxes that it was fused from. The new coordinates are fused in a similar fashion (averaging) except that the coordinates are weighted (meaning that not every box has the same contribution in the resulting fused box). The value of the weight is determined by the confidence score, which makes sense since a low confidence score probably indicates an incorrect prediction.

And that’s it, of course, this is on a high level, if you want to dive deep into the math and low-level details I suggest checking out the paper here. However, in all fairness, I usually get most of the value when I understand how something works on a high-level, implement it, test it and then only get back to the low-level details if needed. If you always dive into the low-level details, you will end up learning how the theory works, but not actually implementing most of it.

It’s also worth mentioning that there is a 4th method called Non-Maximum Weighted Fusion which works in a similar way but doesn’t perform as well as WBF. This is because it doesn’t alter the confidence scores of the boxes and uses the IoU value to weigh the boxes rather than the more accurate measure which is the confidence score. Their performance is quite close though:

The coding part

Okay enough with the theoretical part, let’s start coding! One of the best ways to measure if this paper is actually good or not is to see if they have released good quality code and in this case, they did. You can check it out here on Github

They provide an easy-to-use library. Here is an example

boxes, scores, labels = weighted_boxes_fusion(boxes_list, scores_list, labels_list, weights=weights, iou_thr=iou_thr, skip_box_thr=skip_box_thr)

You can replace "weighted_boxes_fusion" with "nms" or "soft_nms", or "non_maxmimum_weighted" if you want to try other methods and it will just work fine.

The original reason why I came across this library/technique is that in Kaggle’s VinBigData competition, there is are 2 major issues with the dataset that leads to object detection models underperforming.

The original data frame is given by the competition already includes the bounding boxes, labels for each image. So these will be a part of the input of the WBF function above

Since we will be using it here for pre-processing, we can just set the "weights" and "scores" to 1 so that each box will be treated equally (since we have no predictions yet). And that’s it! The library is really simple, it’s just one line of code, you pass in a list of bounding boxes and scores, and you get a more tidy list back.

Final Thoughts

I will be releasing quite a few articles about the lessons I have learned from this competition soon, mostly advanced object detection techniques so follow me if you are interested.


Related Articles