Object Detection On Aerial Imagery Using RetinaNet

ESRI Data Science Challenge 2019 3rd place solution

Kapil Varshney

Published in

Towards Data Science

12 min readMar 17, 2019

(Left) the original image. (Right) Car detections using RetinaNet, marked in green boxes

Detecting cars and swimming pools using RetinaNet

Introduction

For tax assessments purposes, usually, surveys are conducted manually on the ground. These surveys are important to calculate the true value of properties. For example, having a swimming pool can increase the property price. Similarly, the count of cars in a neighborhood or around a store can indicate the levels of economic activity at that place. Being able to achieve this through aerial imagery and AI, can significantly help in these processes by removing the inefficiencies, and the high cost and time required by humans.

To solve this problem, we’ll try to detect cars and swimming pools in RGB chips of 224x224 pixels of aerial imagery. The training dataset had 3748 images with bounding box annotations and labels in PASCAL VOC format.

This problem along with the dataset was posted by ESRI on HackerEarth as the ESRI Data Science Challenge 2019. I participated and secured the 3rd place in the public leaderboard with a mAP (mean Average Precision) of 77.99 atIoU = 0.3 using the state-of-the-art RetinaNet model. In the following post, I’ll explain how I attempted this problem.

RetinaNet

RetinaNet has been formed by making two improvements over existing single stage object detection models (like YOLO and SSD):

Feature Pyramid Network

Pyramid networks have been used conventionally to identify objects at different scales. A Feature Pyramid Network (FPN) makes use of the inherent multi-scale pyramidal hierarchy of deep CNNs to create feature pyramids.

The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classifying anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-stage detectors like Faster R-CNN with FPN while running at faster speeds.

Focal Loss

Focal Loss is an improvement on cross-entropy loss that helps to reduce the relative loss for well-classified examples and putting more focus on hard, misclassified examples.

The focal loss enables training highly accurate dense object detectors in the presence of vast numbers of easy background examples.

Focal Loss Function

If you are further interested in the finer details of the model, I’ll suggest reading the original papers and this very helpful and descriptive blog ‘The intuition behind RetinaNet’.

Now, let’s get started with the actual implementation and get coding. Here’s the Github repository you can follow along:

kapil-varshney/esri_retinanet

Contribute to kapil-varshney/esri_retinanet development by creating an account on GitHub.

github.com

Installing Retinanet

We’ll use the awesome Keras implementation of RetinaNet by Fizyr. I am assuming you have your deep learning machine setup. If not, follow my guide here. Also, I would recommend using a virtual environment. The following script will install RetinaNet and the other required packages.

Alternatively, you can use a GPU instance (p2.xlarge) on AWS with the “deep-learning-for-computer-vision-with-python” AMI. This AMI comes pre-installed with keras-retinanet and other required packages. You can start using the model after activating the RetinaNet virtual environment by workon retinanet command.

Note: Retinanet is heavy on computation. It will require at least 7–8 GBs of GPU memory for a batch size of 4 (224x224) images.

Once the RetinaNet is installed, create the following directory structure for this project.

I’ll explain each one of these in details, but here is an overview:
build_dataset.py — Python script to create the train/test set
config/esri_retinanet_config.py — config file to be used by the build script.
dataset/annotations — directory to hold all image annotations
dataset/images — directory to hold all images
dataset/submission_test_data_images — the submission test directory for the Esri Data Science challenge. You can ignore this if you are working on your own dataset and a different project.
snapshots — the directory where all the snapshots of training will be saved after each epoch
models — the directory where snapshots, converted for evaluation and testing. will be saved
tensorboard — the directory to where training log will be saved to be used by tensorboard
predict.py — script to make predictions on the submission test files

Building the dataset

First, we need to write a config file that will hold the paths to images, annotations, output CSVs — train, test, and classes, and the test-train split value. Having such a config file makes the code versatile for use with different datasets.

In this config file,TRAIN_TEST_SPLIT = 0.75 . It is a standard practice to have a 75–25 or a 70–30 or in some cases even 80–20 split between training and testing dataset from the original dataset. But, for the purpose of this competition, I did not make a testing dataset and used the complete dataset for training. This was done because only a small dataset of 3748 images was provided. Moreover, a testing dataset of 2703 images was provided (without annotations) on which the model could be tested by submitting the predictions online.

Next, let’s write a Python script that will read all the image paths and annotations and output the three CSVs that are required during training and evaluating the model:

train.csv — This file will hold all the annotations for training in the following format: <path/to/image>,<xmin>,<ymin>,<xmax>,<ymax>,<label>
Each row will represent one bounding box, therefore, one image can be present in multiple rows depending on how many objects have been annotated in that image.
test.csv — Similar to train.csv in format, this file will hold all the annotations for testing the model.
classes.csv — A file with all unique class labels in the dataset with index assignments (starting from 0 and ignoring the background)

Let’s start by creating a build_dataset.py file and importing the required packages. Notice, we import the esri_retinanet_config.py file that we created earlier in the config directory and we give it an alias config.

In the code above, we create an argument parser to take in, optionally, the image and annotation paths, output CSV paths, and the train-test split. Yes, I know we defined these arguments in the config file, already. But, I also realized that there were times when I wanted to create a subsample of images for an experiment or have a different train-test split, etc. At that time, having the option to pass on these arguments when executing the script, without changing the config file, was quicker. You can see that I have provided the default values for each argument from the config file itself. So, you are not required to provide these arguments, unless you want to. After parsing the arguments assign easy variable names for each argument.

In the preceding code, we read the image paths into a list, randomize the list, split it into train and test set and store them in another list dataset in the format (<dataset_type>, <list_of_paths>, <outpuCSV>). We’ll also initialize the CLASS set to hold all the unique class labels in the dataset.

Next, we loop over each dataset (train and test) and open the output CSV file to be written. For each dataset, we loop over each image path. For each image, extract the filename and build the corresponding annotation path. This works out because, usually, the image and annotation files have the same name but different extensions. For e.g. dataset/images/0000001.jpg has its annotations in dataset/annotations/0000001.xml. Modify this section if your dataset follows a different naming convention. Using BeautifulSoup parse the annotations (XML) file. We can then find the “width” and “height” and the “object(s)” from the parsed XML.

For every image, find all the objects and iterate over each one of them. Then, find the bounding box (xmin, ymin, xmax, ymax) and the class label (name) for each object in the annotation. Do a cleanup by truncating any bounding box coordinate that falls outside the boundaries of the image. Also, do a sanity check if, by error, any minimum value is larger than the maximum value and vice-versa. If we find such values, we will ignore these objects and continue to the next one.

Now, that we have all the information, we can proceed to write to output CSV, one row at a time. Also, keep adding the labels to CLASSES set. This will eventually end up having all the unique class labels.

The last thing left to build the dataset in the required format is to write the class labels with their respective indexes to a CSV. In the ESRI dataset, there are only two classes — cars (label: ‘1’, index: 1) and swimming pool (label: ‘2’, index: 0). This is how the classes.csv looks for Esri dataset.

2,0
1,1

Training and Evaluating the model

Now, that the dataset is ready and RetinaNet installed, let’s proceed to train the model on the dataset.

# For a list of all arguments
$ retinanet-train --help

To train the model, I used the following command:

$ retinanet-train --weights resnet50_coco_best_v2.1.0.h5 \
--batch-size 4 --steps 4001 --epochs 20 \
--snapshot-path snapshots --tensorboard-dir tensorboard \
csv dataset/train.csv dataset/classes.csv

It is advised to load a pre-trained model or weights file instead of training from scratch to speed up the training (the losses will start to converge earlier). I used the weights from the pre-trained model with ResNet50 backbone on COCO dataset. Use the following link to download the file.

https://github.com/fizyr/keras-retinanet/releases/download/0.5.0/resnet50_coco_best_v2.1.0.h5

The batch-size and the steps will depend on your system config (primarily GPU) and the dataset. I usually start with batch-size = 8 and then increase or decrease by a factor of 2 depending if the model training started successfully or not. If training begins successfully I’ll terminate the training (CTRL+C) and start with a higher batch size else a lower one.

Once, you have decided on the batch-size, you’ll need to calculate the steps you’ll need to cover the total dataset in each epoch. The following command will give you the count of rows in train.csv created earlier in the dataset directory.

$ wc -l datatset/train.csv

The calculation for step size is simple: steps = count of rows in train.csv / batch-size. Next, set the number of epochs . In my experience, RetinaNet converges quickly, so a smaller number of epochs usually does the work. If not you can always, pick up the training from the last epoch and train your model further. Therefore, we’ll provide a snapshot-path where the model will be saved after every epoch.

We will also provide a tensorflow-dir where all the logs will be saved and tensorboard can be run to visualize the training as it proceeds. To launch tensorboard, open a new terminal window and run the below mentioned command. Make sure you have tensorboard installed before running it.

# To launch tensorboard
$ tensorboard --logdir <path/to/logs/dir>

Finally, provide the csv files with training dataset and class labels. And execute the training command. Now, go do an Iron Man or sleep or whatever while your model trains. Each epoch with 3748 (224x224) images took a bit over 2 hours on a K80 Tesla GPU on AWS p2.xlarge instance.

Once the model has trained to your satisfaction, convert the model in a format that can be used for evaluation and predictions.

# To convert the model
$ retinanet-convert-model <path/to/desired/snapshot.h5> <path/to/output/model.h5># To evaluate the model
$ retinanet-evaluate <path/to/output/model.h5> csv <path/to/train.csv> <path/to/classes.csv># Sample evaluation
95 instances of class 2 with average precision: 0.8874
494 instances of class 1 with average precision: 0.7200
mAP: 0.8037

In this sample evaluation on 125 test images, the model was able to achieve 80.37% mAP (mean Average Precision) with 375 images training for 18 epochs. It’s a good result for such a small dataset.

Predictions

Build a script predict.py that will use the trained model, make predictions on the submission images and write it on the disk.

A few methods from the keras_retinanet utils are required to pre-process the image before it is fed into the model for predictions. Also, import the config file, we created earlier, for loading a few paths.

Construct the argument parser to accept arguments when executing the script, and then parse the arguments. The argument model will take in the path to the trained model file which will be used for making predictions. For class labels and predictions output directory, the default values have been taken from the config file. Therefore, these are not required arguments. The argument input will take in the path of the directory containing images to make the predictions on. Also, the confidence argument is available to filter weak predictions.

Next, load the class label mapping from the class label CSV and make it into a dictionary. Load the model to be used for prediction. Use the dir path provided in input argument to grab and make a list of all the image paths.

Iterate over each image path so that we can make predictions on each image in the provided dataset. Lines 6–9 in the code above extract the image file name from the image path and then construct and open an output text file path where the predictions for that image will be saved. In lines 11–15, we load the image, preprocess it, resize it and then expand its dimensions before passing it to the model. In Line 18, we pass the preprocessed image to the model and it returns the predicted boxes (bounding box coordinates), probability scores for each box and the associated labels. In the last line in the block above, rescale the bounding box coordinates according to the original image size.

Next, iterate over each detection that is predicted by the model. Skip the ones whose score is less than the confidence value provided. Although, if you want to calculate the mAP (mean Average Precision) keep all the predictions. For this, pass the value of confidence argument as 0.0. Bounding box coordinates will be float values, so convert them into int. Construct the row for each prediction in the required format: <classname> <confidence> <ymin> <xmin> <ymax> <xmax> and write it to the file. Close the file once all the detections for that image have been written to the corresponding file.

$ python predict.py --model models/output.h5 --input dataset/submission_test_data_images --confidence 0.0

Run the above command to execute the predict.py script. Feel free to change the arguments according to your dataset and project.

Experiments and Results

Initially, I trained the model using only 10% of the data (375 images) for 18 epochs. This model had mAP of 71 with a confidence value of 0.5 on the test images. I resumed training the model on the complete dataset of 3748 images for another 10 epochs to result in an increased mAP of 74. I decided to engineer the model a bit and make changes to the anchor boxes. The dataset had only square bounding boxes, and I changed the aspect ratios of the boxes from [0.5, 1, 2] to [1]. It seemed like a good experiment to try, but I realized that it wasn’t as the anchor boxes ratios will change as the images are augmented. It resulted in the network training much faster than before with the total dataset as the network size decreased. The accuracy of predictions also increased a bit but then started to drop. I decided to use the 2nd epoch results with a confidence value of 0.0 to include all predictions. This resulted in the mAP of 77.99 which secured me the 3rd place in the challenge. I also, unsuccessfully, tried a few other experiments with scales of the images to be used for FPN and data augmentation parameters but stuck with earlier results for the final submission.

Summary

In this post, we talked about the state-of-the-art RetinaNet model and how I used it for the Esri Data Science Challenge 2019 to detect cars and swimming pools in 224x224 tiles of aerial imagery. We started with structuring the project directory. Next, we built the train/test dataset to be used by the model. The model was trained with the appropriate arguments, and later the trained model was converted for evaluation and prediction. We created another script to make detections on the submission test images and to write the predictions on the disk. In the end, I briefly describe the experiments I tried and the results I achieved.

References

Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a…

arxiv.org

Feature Pyramid Networks for Object Detection

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent…

arxiv.org

Deep Learning for Computer Vision with Python: Master Deep Learning Using My New Book

Struggling to get started with deep learning for computer vision? My new book will teach you all you need to know.

www.pyimagesearch.com

Thanks for going through this post. I hope it helps you. Feel free to leave a message with comments/suggestions. You can also connect with me on LinkedIn. Here’s the GitHub repository with the code:

kapil-varshney/esri_retinanet

Contribute to kapil-varshney/esri_retinanet development by creating an account on GitHub.

github.com

Object Detection On Aerial Imagery Using RetinaNet

ESRI Data Science Challenge 2019 3rd place solution

Introduction

RetinaNet

Feature Pyramid Network

Focal Loss

kapil-varshney/esri_retinanet

Contribute to kapil-varshney/esri_retinanet development by creating an account on GitHub.

Installing Retinanet

Building the dataset

Training and Evaluating the model

Predictions

Experiments and Results

Summary

References

Focal Loss for Dense Object Detection

The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a…

Feature Pyramid Networks for Object Detection

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent…

Deep Learning for Computer Vision with Python: Master Deep Learning Using My New Book

Struggling to get started with deep learning for computer vision? My new book will teach you all you need to know.

kapil-varshney/esri_retinanet

Contribute to kapil-varshney/esri_retinanet development by creating an account on GitHub.

Written by Kapil Varshney