Making Sense of Big Data

Detecting objects in urban scenes using YOLOv5

Published in

Towards Data Science

9 min readApr 2, 2021

As part of my Master’s degree in Machine Learning at MILA (Quebec’s AI Institute) and while working at the City of Montreal, I developed an AI enabled urban object detection solution for video feeds sourced from Pan-Tilt-Zoom (PTZ) traffic cameras. This prototype can detect five different classes of objects, i.e. vehicles, pedestrians, buses, cyclists and construction objects. Here is one example of detection for one frame, taken at the intersection of De La Montagne St. and René-Levesque boulevard in downtown Montreal.

Example of object detection showing all five object classes — Image by Author

To show how the model generalizes to another urban context, here is one example of an offline detection (i.e. not in real time) where an inference was produced for every frame of this test youtube video.

Example of offline detection on a test urban scene from this youtube video — Video by Author

Thanks to the City of Montreal open data policy and open source software policy, I am happy to announce that my project is now becoming open-source! This week, we have released the code, trained models as well as all annotated images (dataset).

I have also written a detailed technical report which, pending review and approval by MILA, I will be able to share as well (stay tuned!). In the meantime, my goal with this blog post is to give a brief overview of the project and its outcomes and bring awareness about this newly open dataset.

IMPORTANT: Before going further, I want to clarify the motivation of this project to reassure those who may have doubts about the use of this technology. The Centre de Gestion de la Mobilité Urbaine (CGMU) is the heart and brain of intelligent transport system at the City of Montreal. With over 500 traffic cameras (see map) installed throughout the territory, CGMU operators can monitor traffic on the road network and address issues when they arise (e.g. accidents or a broken down vehicle). To help them detect incidents faster, the City of Montreal wishes to setup an automatic road anomaly detection system. This object detection solution that we are now releasing constitutes one important building block that will contribute to achieve this longer term goal.

ALSO IMPORTANT: Respecting citizen privacy is something that was taken very seriously throughout the project, especially now that we are releasing the annotations. It is important to note that these annotations are associated to images that were already publicly available via the city’s open data website. Additionally, while the cameras used to obtain these images were installed to assist CGMU operators in their daily tasks of traffic management, they were also tuned to limit the information collected. For example, the camera resolution is set so as not to enable recognizing faces nor license plates, and images are kept and published only at 5-minute intervals. Montreal’s digital data charter is a good reference for those would like to know more about how the city manages and regulates the life cycle of digital data.

Now that we have cleared this up, let’s jump into it!

Key Features

Here are some key features of the final object detection models. They:

Were trained on a single GPU using nearly 19k images (7,007 training images from the City of Montreal +11,877 images from MIO-TCD dataset)
Include a small and larger neural network architectures that can run at a latency of 167 ms on CPU and 16.6 ms on GPU respectively.
Are compatible with different camera makes and models, image resolutions, as well as different zoom and orientation settings.
Are robust to visual artifacts (e.g. out of focus, rain drop or dust on lens, sun glare, etc) and different day and weather conditions.

Dataset

The City of Montreal dataset is composed of 10,000 images of resolutions 350x288, 352x240 and 704x480 with their associated object detection annotations for vehicles, buses, pedestrians, cyclists and constructions objects in Pascal VOC format. All annotations were generated using CVAT (Computer Vision Annotation Tool). The dataset was then split into four parts, as following:

70% of images for training,
10% for validation,
10% for in-domain testing
10% for out-domain testing

The training set was used for learning model weights and biases, while the validation set was used during training to monitor performance and determine when the model started overfitting. It was also used to evaluate different configurations of hyperparameters. The test sets were used at the end of the projet to report performance.

The training, validation and in-domain testing sets all use the same sub-population of cameras (i.e. intersections), while the out-domain test split uses a reserved population of cameras that are not seen during training.

The table below shows how many instances of each class are included in each set. We can see that the distribution of classes is unbalanced, with far more vehicles than cyclists and buses.

Total number of images and objects from different classes in each data split — City of Montreal Dataset — Image by Author

To help provide more training examples of under-represented classes, an additional dataset was used, i.e. the MIOvision Traffic Camera Dataset (MIO-TCD). Since it contains images with very similar angles of view, and with very similar resolutions, it was perfect for our needs. It also contains images from winter months, which was an interesting addition considering that the City of Montreal dataset only included images from summer months. All images that contained cyclists, pedestrians and buses were selected for a total of 11877 additional images. As a result, we were able to add 2260 cyclists and 8319 buses to the training examples. However, the MIO-TCD does not contain any labeled construction objects.

Sample images from MIO-TCD dataset — Image by Author

Models

YOLOv5 is an object detection model that was released in May 2020 as a Pytorch implementation on github and which was selected as the foundation for this project. At the time of evaluating our options, YOLOv5 was one of the fastest and most accurate object detection model available. In addition, it benefited from a very large community of users, which meant that it was under active development with improvements being made on a weekly basis. For the sake of stability, commit bb8872 (released on September 8th, 2020) was forked and used for project specific development. YOLOv5 includes 4 different network architecture sizes: Small (S), Medium (M), Large (L) and X-Large (X).

The Single Shot Multibox Detector (SSD) was selected as a baseline model to compare with YOLOv5. At the time of its creation in 2015, the SSD was one of the fastest model available, making it a very well-suited solution for real-time applications. It is not longer considered state-of-the-art but, given its simplicity, remains oftenly used as a baseline model.

Transfer Learning

Pre-training a model on a very large dataset to learn meaningful representations and subsequentially fine-tuning it on the task of interest is often beneficial to performance. This strategy was adopted for both YOLOv5 (pre-trained on the MS COCO object detection dataset) and the SSD (pre-trained on the ImageNet image classification dataset).

Augmentations

Both YOLOv5 and the SSD use data augmentations at training time to obtain a solution that is less prone to overfitting. From epoch to epoch, various augmentations are sampled and applied to the same input images, which results in artifically increasing the dataset size and input image variability. Some examples of photometric and geometric augmentations are shown below:

Examples of photometric expansions — Image by Author

Additional examples of photometric expansions — Image by Author

Examples of geometric augmentations — Image by Author

Mosaic augmentations is a new method of image augmentation that were introduced in YOLOv4 and remained in YOLOv5. A mosaic augmentation consists in mixing 4 different training images, which has the effect of allowing detection of objects outsite of their normal context, hence improving generalization. Another benefit is that it reduces the need for using large mini-batch size.

Computational Resources

All experiments and training were conducted using a Google Colab Pro instance (which includes either an NVIDIA Tesla P100 or V100), with the expection of the hyperparameter search which was performed using a GPU instance on the Google Cloud Platform (GCP). Here is useful tutorial for setting up GCP for training our models.

Results

At first glance, it was possible to show how much the detection performance, expressed in terms of the mean Average Precision (mAP), is far superior for YOLOv5 compared to the baseline SSD model. When trained on only the City of Montreal dataset, the SSD, with its 26.7 M parameters achieves an mAP of 0.466 while the YOLOv5 ‘M’ model achieves an mAP of 0.663 with roughly the same model size. All four YOLOv5 variants are in fact faster during inference and faster to train.

Influence of model size on training time, performance and latency. Input image size: 320x320. Trained on the City of Montreal dataset only, for 300 epochs — Image by Author

Beside the model architecture size, another configuration setting that is found to significantly influence the performance and latency is the input image size. The baseline SSD uses an input image size of 300x300, while sizes of 320x320, 512x512 and 704x704 were tested on the YOLOv5 ‘X’ model. The City of Montreal dataset largest image resolution is 704x480, thus with an input image size of 704x704, no pixel information is lost. This can have a big impact on the detection performance of small objects, like pedestrians and construction cones for example, because they tend to be very narrow. The table below shows that with a higher image input size, the detection performance goes up, but at an increased GPU latency and training time.

Influence of input image size on training time, performance and latency. Batch sizes of 1 and 32 are used for CPU and GPU inference respectively. Dataset: City of Montreal, trained for 300 epochs — Image by Author

At a fixed architecture and input image size, the addition of the MIO-TCD dataset was found to increase performance of the YOLOv5 model, but not the SDD. The table below shows the incremental benefit for each class. All classes are seeing an improvement, except the construction items which lose some performance. This is not surprising because there are no construction objects in the MIO-TCD dataset. One drawback of combining both dataset is the increase of training time, which is not surprising considering that the amount of training examples more than doubles.

Influence of adding MIO-TCD dataset on performance and training time. Trained for 300 epochs — Image by Author

A rigourous hyperparameter search was undertaken using Orion, an asynchronous framework for black-box function optimization, which is already integrated as part of the code that was just released by the City of Montreal. More details on this aspect will be available in the report. An interesting takeaway was that the performance improvement achieved through the hyperparameter search was minimal compared to that obtained with the baseline hyperparameters from the original YOLOv5 repository. This means that one could already expect very good performance using the baseline hyperparameters and save themselves the exploration process.

The best final models were evaluated on the test sets. The table below shows the final performance on the in-domain (intersections seen during training) and the out-domain (new intersections) test images.

Final model performance on in-domain and out-domain test sets — Image by Author

As expected, the performance deteriorates on unseen intersections, but the quality of the detection remains excellent to the human eye, as shown below for the small model.

YOLOv5 Small with input image size of 512x512 : four out-domain test set images — Image by Author

Conclusion

With the future goal of detecting incidents and other anomalies on the road network, a natural next step that the City of Montreal could follow in the near future would be to incorporate one pre-trained YOLOv5 architecture into a multi-object tracking neural solution. To help generating a multi-object tracking dataset required to train this solution, one may use our best proposed model as a mean for pre-annotating images and minimize the burden on human annotators.