The world’s leading publication for data science, AI, and ML professionals.

Advanced YoloV5 tutorial – Enhancing YoloV5 with Weighted Boxes Fusion

An in-depth tutorial on using YoloV5 and boosting its performance with WBF

Photo by Eric Karim Cornelis on Unsplash
Photo by Eric Karim Cornelis on Unsplash

There are tons of YoloV5 tutorials out there, the aim of this article is not to duplicate the content but rather extend on it. I have recently been doing a data science object detection competition and although I found tons of tutorials creating baselines, I didn’t find any suggestions on how to extend it. Furthermore, I want to highlight the most important parts about the YoloV5 configuration that can affect performance, because after all data science is mostly about experiments and hyperparameter tweakings.

Before we being, I just want to say using object detection models is different than using image classification models in terms of how the frameworks and libraries work. This is something that I noticed and it took me a while to wrap my head around. Most of the popular object detection models like YoloV5, EfficientDet use a command-line interface to train and evaluate rather than a coding approach. This means that literally, all you need to do is to get the data in a specific format (either COCO or VOC) and point the cmd to it. This is usually different than image classification models where you would be training and evaluating the model using code.

Data preprocessing

YoloV5 expects you to have 2 directories one for training and one for validation. In each of those 2 directories, you need to have another 2 directories, "Images" and "Labels". Images would contain the actual images and labels should have a .txt file for each image with the annotation of that image, the text file should have the same name as its corresponding image.

The annotation format is as follows:

<'class_id'> <'x_center'> <'y_center'> <width'> <'height'>

To do this in code, you will probably need a function similar to this, where the original data frame has entries of the images, their class id, and their bounding boxes:

def create_file(df, split_df, train_file, train_folder, fold):

    os.makedirs('labels/train/', exist_ok=True)
    os.makedirs('images/train/', exist_ok=True)
    os.makedirs('labels/val/', exist_ok=True)
    os.makedirs('images/val/', exist_ok=True)

    list_image_train = split_df[split_df[f'fold_{fold}']==0]['image_id']    
    train_df = df[df['image_id'].isin(list_image_train)].reset_index(drop=True)
    val_df = df[~df['image_id'].isin(list_image_train)].reset_index(drop=True)

    for train_img in tqdm(train_df.image_id.unique()):
        with open('labels/train/{train_img}.txt', 'w+') as f:
            row = train_df[train_df['image_id']==train_img]
            [['class_id', 'x_center', 'y_center', 'width', 'height']].values
            row[:, 1:] /= SIZE # Image size, 512 here
            row = row.astype('str')
            for box in range(len(row)):
                text = ' '.join(row[box])
                f.write(text)
                f.write('n')
        shutil.copy(f'{train_img}.png', 
                f'images/train/{train_img}.png')

    for val_img in tqdm(val_df.image_id.unique()):
        with open(f'{labels/val/{val_img}.txt', 'w+') as f:
            row = val_df[val_df['image_id']==val_img]
            [['class_id', 'x_center', 'y_center', 'width', 'height']].values
            row[:, 1:] /= SIZE
            row = row.astype('str')
            for box in range(len(row)):
                text = ' '.join(row[box])
                f.write(text)
                f.write('n')
        shutil.copy(f'{val_img}.png', 
                f'images/val/{val_img}.png')

Source: Kaggle

Note: Don’t forget that the coordinates of the bounding boxes saved in the labels’ text files must be normalized (from 0 to 1). This is very important. Also if the image has more than one annotation, in the text file, each annotation (prediction + bounding box) would be on a separate line.

After that, you need a config file with the names of the labels, the number of classes, and the training & validation paths.

import yaml
classes = [ 'Aortic enlargement',
 'Atelectasis',
 'Calcification',
 'Cardiomegaly',
 'Consolidation',
 'ILD',
 'Infiltration',
 'Lung Opacity',
 'Nodule/Mass',
 'Other lesion',
 'Pleural effusion',
 'Pleural thickening',
 'Pneumothorax',
 'Pulmonary fibrosis']
data = dict(
 train = '../vinbigdata/images/train', # training images path
 val = '../vinbigdata/images/val', # validation images path
 nc = 14, # number of classes
 names = classes
 )
with open('./yolov5/vinbigdata.yaml', 'w') as outfile:
 yaml.dump(data, outfile, default_flow_style=False)

Now, all that you need to do is to run this command:

python train.py - img 640 - batch 16 - epochs 30 - data ./vinbigdata.yaml - cfg models/yolov5x.yaml - weights yolov5x.pt

Things to watch out for from experience:

Okay now that we have skimmed through the basics, let’s go over the important stuff:

  1. Don’t forget to normalize the coordinates
  2. If your initial performance is much worse than expected, the most likely reason for this happening (and I have seen this with tons of other competitors) is that you have done something wrong with your preprocessing. It seems quite trivial, but there are a lot of details that you have to watch out for, especially if it is your first time.
  3. There are multiple YoloV5 models (yolov5s, yolov5m, yolov5l, yolov5x), don’t just pick the biggest one because it might overfit. start with a baseline like the medium one and try to improve it.
  4. Although I was training on 512 images, I found that passing in the -img flag as 640 improves performance
  5. Don’t forget to load the pre-trained weights (-weights flag). Transfer learning will improve your performance greatly and will save you a lot of training time (around 50 epochs in my case, where each epoch takes around 20 mins!)
  6. Yolov5x requires a huge amount of memory, when trained on 512 images with a batch size of 4, it needed around 14GB of GPU memory (most GPUs have around 8GB memory).
  7. YoloV5 already uses augmentations, and you can select whichever ones you like and don’t, all that you need to do is mess around with yolov5/data/hyp.scratch.yml
  8. The default yolov5 training script using weights and biases, which to be honest was quite impressive, it saves all of your metrics while the model is training. However, if you want to turn it off, just add WANDB_MODE="dryrun" to the training script flags
  9. One of the things that I wish that I have found about earlier is that YoloV5 saves tons of useful metrics to the directory yolov5/runs/train/exp/. After training, you can find "confusion_matrix.png" and "results.png" where results.png should look something like this:
Image reproduced by the author. Probably the most 2 important metrics are mAP@0.5 and mAP@0.5:0.95
Image reproduced by the author. Probably the most 2 important metrics are [email protected] and [email protected]:0.95

Preprocessing with WBF

Photo by Siebe Warmoeskerken on Unsplash
Photo by Siebe Warmoeskerken on Unsplash

Okay, now that you have tweaked the hyperparameters, upgraded your model, tested with multiple image sizes and cross-validation. It’s time to introduce some more tricks to boost performance.

Weighted Boxes fusion is a method to dynamically fuse the boxes either before training (which cleans up the data set) or after training (making the predictions more accurate). If you want to know more, you can check out my article here:

WBF: Optimizing object detection – Fusing & Filtering predicted boxes

To use it for preprocessing the dataset, which improved the performance for most competitors by roughly 10–20%, you can use something like this:

from ensemble_boxes import *
for image_id in tqdm(df['image_id'], leave=False):
        image_df = df[df['image_id']==image_id].reset_index(drop=True)
        h, w = image_df.loc[0, ['height', 'width']].values
        boxes = image_df[['x_min', 'y_min',
                          'x_max', 'y_max']].values.tolist()
        # Normalise all the bounding boxes (by dividing them  by size-1
        boxes = [[j/(size-1) for j in i] for i in boxes]
        scores = [1.0]*len(boxes) # set all of the scores to 1 since we only have 1 model here
        labels = [float(i) for i in image_df['class_id'].values]
        boxes, scores, labels = weighted_boxes_fusion([boxes], [scores], [labels],weights=None,iou_thr=iou_thr,
                                          skip_box_thr=skip_box_thr)
        list_image.extend([image_id]*len(boxes))
        list_h.extend([h]*len(boxes))
        list_w.extend([w]*len(boxes))
        list_boxes.extend(boxes)
        list_cls.extend(labels.tolist())
    # bring the bounding boxes back to their original size  (by multiplying by size - 1)
    list_boxes = [[int(j*(size-1)) for j in i] for i in list_boxes]
    new_df['image_id'] = list_image
    new_df['class_id'] = list_cls
    new_df['h'] = list_h
    new_df['w'] = list_w
    # Unpack the coordinates from the  bounding boxes
    new_df['x_min'], new_df['y_min'], 
    new_df['x_max'], new_df['y_max'] = np.transpose(list_boxes)

This is meant to be done before you save the bounding box coordinates to the annotation files. You can also try to use it after predicting the bounding boxes with YoloV5 in the same way.

First, after training YoloV5, run:

!python detect.py - weights /runs/train/exp/weights
 - img 640
 - conf 0.005
 - iou 0.45
 - source $test_dir
 - save-txt - save-conf - exist-ok

Then extract the boxes, scores, and labels from:

runs/detect/exp/labels

And pass them to:

boxes, scores, labels = weighted_boxes_fusion([boxes], [scores], [labels],weights=None,iou_thr=iou_thr,
                                          skip_box_thr=skip_box_thr)

I didn’t want to include all of the post-processing code because it includes a lot of details that were quite specific to the competition I am doing.

Final thoughts

I hope you have learned a thing or 2 about extending your baseline YoloV5, I think the most important things to always think about are transfer learning, image augmentation, model complexity, pre & post-processing techniques. Those are most of the aspects that you can easily control and use to boost your performance with YoloV5.


Related Articles