The practical guide for Object Detection with YOLOv5 algorithm

Detailed tutorial explaining how to efficiently train the object detection algorithm YOLOv5 on your own custom dataset.

Lihi Gur Arie, PhD

Published in

Towards Data Science

9 min readMar 14, 2022

Labels by Author, Image by National Science Foundation, http://www.nsf.gov/

Introduction

Identification of objects in an image considered a common assignment for the human brain, though not so trivial for a machine. Identification and localization of objects in photos is a computer vision task called ‘object detection’, and several algorithms has emerged in the past few years to tackle the problem. One of the most popular algorithms to date for real-time object detection is YOLO (You Only Look Once), initially proposed by Redmond et. al [1].

In this tutorial you will learn to perform an end-to-end object detection project on a custom dataset, using the latest YOLOv5 implementation developed by Ultralytics [2]. We will use transfer-learning techniques to train our own model, evaluate its performances, use it for inference and even convert it to other file formats such as ONNX and TensorRT.

The tutorial is oriented to people with theoretical background of object detection algorithms, who seek for a practical implementation guidance. An easy-to-use Jupiter notebook with the full code is provided below for your convenience.

Data handling

Dataset creation

For this tutorial I generated my own penguins dataset, by manually tagging about ~250 images and video frames of penguins from the web. It took me few hours using Roboflow platform, which is friendly and free for public users [3]. To achieve a robust YOLOv5 model, it is recommended to train with over 1500 images per class, and more then 10,000 instances per class. It is also recommended to add up to 10% background images, to reduce false-positives errors. Since my dataset is significantly small, I will narrow the training process using transfer learning technics.

YOLO labeling format

Most annotation platforms support export at YOLO labeling format, providing one annotations text file per image. Each text file contains one bounding-box (BBox) annotation for each of the objects in the image. The annotations are normalized to the image size, and lie within the range of 0 to 1. They are represented in the following format:

< object-class-ID> <X center> <Y center> <Box width> <Box height>

If there are two objects in the image, the content of the YOLO annotations text file might look like this:

Data directories structure

To comply with Ultralytics directories structure, the data is provided at the following structure:

For convenience, on my notebook I supplied a function to automatically create these directories, just copy your data into the right folder.

Configuration files

The configurations for the training are divided to three YAML files, which are provided with the repo itself. We will customize these files depending on the task, to fit our desired needs.

The data-configurations file describes the dataset parameters. Since we are training on our custom penguins dataset, we will edit this file and provide: the paths to the train, validation and test (optional) datasets; the number of classes (nc); and the names of the classes in the same order as their index. In this tutorial we only have one class, named ‘Penguin’. We named our custom data configurations file as ‘penguin_data.yaml’ and placed it under the ‘data’ directory. The content of this YAML file is as follow:

2. The model-configurations file dictates the model architecture. Ultralytics supports several YOLOv5 architectures, named P5 models, which varies mainly by their parameters size: YOLOv5n (nano), YOLOv5s (small), YOLOv5m (medium), YOLOv5l (large), YOLOv5x (extra large). These architecture are suitable for training with image size of 640*640 pixels. Additional series, that is optimized for training with larger image size of 1280*1280, called P6 (YOLOv5n6, YOLOv5s6, YOLOv5m6, YOLOv5l6, YOLOv5x6). P6 models include an extra output layer for detection of larger objects. They benefit the most from training at higher resolution, and produce better results [4].

Ultralytics provides build-in, model-configuration files for each of the above architectures, placed under the ‘models’ directory. If you’re training from scratch, choose the model-configurations YAML file with the desired architecture (‘YOLOv5s6.yaml’ in this tutorial), and just edit the number of classes (nc) parameter to the correct number of classes in your custom data.

When training is initialized from pre-trained weights as in this tutorial, no need to edit the model-configurations file since the model will be extracted with the pretrained weights.

3. The hyperparameters-configurations file defines the hyperparameters for the training, including the learning rate, momentum, losses, augmentations etc. Ultralytics provides a default hyperparameters file under the ‘data/hyp/hyp.scratch.yaml’ directory. It is mostly recommended to start training with default hyperparameters to establish a performance baseline, as we’ll do on this tutorial.

The YAML configuration files are nested at the following directories:

Training

For the simplicity of this tutorial, we will train the small parameters size model YOLOv5s6, though bigger models can be used for improved results. Different training approaches might be considered for different situations, and here we will cover the most commonly used techniques.

Training from scratch

When having a large enough dataset, the model will benefit most by training from scratch. The weights are randomly initialized by passing an empty string (‘ ‘) to the weights argument. Training is induced by the following command:

batch — batch size (-1 for auto batch size). Use the largest batch size that your hardware allows for.
epochs — number of epochs.
data — path to the data-configurations file.
cfg — path to the model-configurations file.
weights — path to initial weights.
cache — cache images for faster training.
img — image size in pixels (default — 640).

Transfer learning

Hot start from pretrained model:

Since my penguins dataset is relatively small (~250 images), transfer learning is expected to produce better results than training from scratch. Ultralytic’s default model was pre-trained over the COCO dataset, though there is support to other pre-trained models as well (VOC, Argoverse, VisDrone, GlobalWheat, xView, Objects365, SKU-110K). COCO is an object detection dataset with images from everyday scenes. It contains 80 classes, including the related ‘bird’ class, but not a ‘penguin’ class. Our model will be initialize with weights from a pre-trained COCO model, by passing the name of the model to the ‘weights’ argument. The pre-trained model will be automatically download.

Feature extraction

Models are composed of two main parts: the backbone layers which serves as a feature extractor, and the head layers which computes the output predictions. To further compensate for a small dataset size, we’ll use the same backbone as the pretrained COCO model, and only train the model’s head. YOLOv5s6 backbone consists of 12 layers, who will be fixed by the ‘freeze’ argument.

weights — path to initial weights. COCO model will be downloaded automatically.
freeze — number of layers to freeze
project— name of the project
name — name of the run

If ‘project’ and ‘name’ arguments are supplied, the results are automatically saved there. Else, they are saved to ‘runs/train’ directory. We can view the metrics and losses saved to results.png file:

Results of ‘feature extraction’ training | image by author

To better understand the results, let’s summarize YOLOv5 losses and metrics. YOLO loss function is composed of three parts:

box_loss — bounding box regression loss (Mean Squared Error).
obj_loss — the confidence of object presence is the objectness loss.
cls_loss — the classification loss (Cross Entropy).

Since our data has one class only, there are no class mis-identifications, and the classification error is constantly zero.

Precision measures how much of the bbox predictions are correct ( True positives / (True positives + False positives)), and Recall measures how much of the true bbox were correctly predicted ( True positives / (True positives + False negatives)). ‘mAP_0.5’ is the mean Average Precision (mAP) at IoU (Intersection over Union) threshold of 0.5. ‘ mAP_0.5:0.95’ is the average mAP over different IoU thresholds, ranging from 0.5 to 0.95. You can read more about it at reference [5].

Fine Tuning

The final optional step of training is fine-tuning, which consists of un-freezing the entire model we obtained above, and re-training it on our data with a very low learning rate. This can potentially achieve meaningful improvements, by incrementally adapting the pretrained features to the new data. The learning rate parameter can be adjusted at the hyperparameters-configurations file. For the tutorial demonstration, we’ll adopt the hyperparameters defined at built-in ‘hyp.finetune.yaml’ file, which has much smaller learning rate then the default. The weights will be initialized with the weights saved on the previous step.

python train.py --hyp 'hyp.finetune.yaml' --batch 16 --epochs 100 --data 'data/penguins_data.yaml' --weights 'runs_penguins/feature_extraction/weights/best.pt' --project 'runs_penguins' --name 'fine-tuning' --cache

hyp — path to the hyperparameters-configurations file

As we can see below, during fine tuning stage the metrics and losses are still improving.

Results of ‘fine tuning’ training | image by author

Validation

To evaluate our model we’ll utilize the validation script. Performances can be evaluated over the training, validation or test dataset splits, controlled by the ‘task’ argument. Here, the test dataset split is being evaluated:

We can also obtain the Precision-Recall curve, which automatically saved at each validation.

Precision — Recall Curve of the test data split | Image by author

Inference

Once we obtained satisfying training performances, our model is ready for inference. Upon inference, we can further boost the predictions accuracy by applying test-time augmentations (TTA): each image is being augmented (horizontal flip and 3 different resolutions), and the final prediction is an ensemble of all these augmentation. If we’re tight on the Frames-Per-Second (FPS) rate, we’ll have to ditch the TTA since the inference with it is 2–3 times longer.

The input for inference can be an image, a video, a directory, a webcam, a stream or even a youtube link. In the following detection command the test data is used for inference.

source — input path (0 for webcam)
weights — weights path
img — image size for inference, in pixels
conf — confidence threshold
iou — IoU threshold for NMS (Non Max Supression)
augment — augmented inference (TTA)

Inference results are automatically saved to the defined folder. Let’s review a sample of the test predictions:

Export to other file formats

By now, our model is completed, and saved as the common PyTorch convention with ‘.pt’ file extension. The model can be exported to other file formats such as ONNX and TensorRT. ONNX is an intermediary machine learning file format used to convert between different machine learning frameworks [6]. TensorRT is a library developed by NVIDIA for optimization of machine learning model, to achieve faster inference on NVIDIA graphics processing units (GPUs)[7].

The ‘export.py’ script is used to convert PyTorch models to ONNX, TensorRT engine or other formats, by appling the type format to the ‘include’ argument. The following command is used to export our penguins model to ONNX and TensorRT. These new file formats are saved under the same ‘weights’ folder as the PyTorch model.