Introduction
Training an object detection model to detect small objects can be very difficult, especially if computing resources are limited. Object detection model performance can be improved significantly by taking crops of the annotated data, or slices, as defined below:
Slice: The combined image and annotation file-pair for a sub region of a base annotated image, with annotation object box coordinates normalized for the sub regions coordinates
Data Cropping
Taking slices for large data sets is time-consuming and tedious, so a custom Data Cropper Tool was developed to do the work. Individuals interested in using it must contact me directly. Some basic parameters that can be specified are listed below:
- Crop Image Resolution – nominal width and height for desired output e.g. 512 x 512 or 640 x 640
- Tolerance – the percent tolerance +/- of allowable deviation for the crop image resolution
- Upscaling – the scale factor used to increase the slice borders by in the final output
- Offset – the width/height pixel offset added to the slice borders in the final output
- Algorithm type – specify which grouping algorithm will be used
The Data Cropper Tool will take images similar to the featured image of this article and output slices shown below


The dataset used for this demonstration is the Stanford Drone Dataset¹. The dataset contains thousands of high resolution images that contain thousands of annotated objects across 6 classes (Bicyclists, Pedestrians, Skateboarders, Carts, Cars, and Buses). Only 3000 annotated frames from the dataset were used for training. The SSD ResNet FPN³ Object Detection model is used with a resolution of 640×640. An FPN model was specifically chosen due to its ability to detect smaller objects more accurately. When the base image is resized during training, a few pixels will represent the objects features. By taking smaller crops of the base image, fewer features of the objects get distorted. However, the model’s training config parameters must also be tuned differently. Additionally, the Data Cropper prioritizes taking square or near-square crops, so that the image isn’t distorted during training cropping. The difference in training input images can be seen below:


Results
There are a few metrics important in determining the overall performance of using the base training data vs the sliced data for training. In this example, the metrics used to assess the performance are:
- Mean Average Precision (mAP)
- Training time
- Inference time
The Object-Detection-Metrics⁴ repository is used to assess the mAP of all the models. The standard configuration file is used for training the models, with the following changes bolded below:
ssd {
inplace_batchnorm_update: true
freeze_batchnorm: false
num_classes: 6
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
encode_background_as_zeros: true
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: [1.0, 2.0, 0.5]
scales_per_octave: 4
normalize_coordinates: true
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 10e-2
iou_threshold: 0.6
max_detections_per_class: 300
max_total_detections: 400
}
score_converter: SIGMOID
}
train_config: {
fine_tune_checkpoint_version: V2
fine_tune_checkpoint: "C:/Tensorflow/models/research/object_detection/ssd_resnet50_v1_fpn_640x640_coco17_tpu-8/checkpoint/ckpt-0"
fine_tune_checkpoint_type: "detection"
batch_size: 4
sync_replicas: true
startup_delay_steps: 0
replicas_to_aggregate: 8
use_bfloat16: false
num_steps: 6000
data_augmentation_options {
random_horizontal_flip {
}
}
Three models were trained with the following slicing configurations:
- No slices – base image used
- 480 x 480 Slices
- 640 x 640 Slices
The mAP for each of these three configurations is shown below:

The models using the sliced data perform 30 to 50 percent better than the base data, depending on the slice resolution chosen. For the tested crop resolution range of 480 to 640, the mAP at all thresholds except for 0.90 increases with the crop resolution. A possible explanation for this is that while the crop resolution increased, the tolerance was held constant at 10 %, allowing more freedom for the cropping tool to generate better crops of the data, and also more crops. Additionally, as the slice dimension increases past 640, lower resolution slices aren’t being generated for training to upsize them to 640 x 640. The 10 % tolerance allows for deviation in the positive or negative of the input slice dimension, allowing for 432 x 432 slices to be taken when specifying a slice dimension of 480. This effect completely disappears after a specified slice dimension over 710 is used with a 10 % tolerance.
The computer used for training is running Tensorflow-GPU and has the following specifications:
- NVIDIA RTX 2060 Super (8 GB Vram)
- AMD Ryzen 5 3600 6 Core Processor ( 3.6 GHz)
- G.Skill Ripjaw V Series 32 GB 3200 MHz DDR 4 Ram
- Samsung 970 EVO 500 GB SSD 3500/2500 MB/s Read/Write Speed
The average steps per second plotted on Tensorboard are used to compare training time between all three models, and is summarized below:

It appears that the various slicing configurations train at a similar rate, with the 480 x 480 configuration training 13 % slower than the 640 x 640 and base configurations.
The inference time was determined by running the various detection models on a video file of drone footage taken by the author. The Inference speed term is used synonymously with frames per second achieved by detecting objects in the video. The Inference speed remains fairly constant for all three of the models, and is summarized below:

Conclusion
The key take away is that taking data crops significantly increases the performance of an object detection model when detecting small objects, with little performance speed cost when inferencing.
All of the results have been tabulated for the same configuration parameters, so further performance upgrades can be made by tuning the training config parameters of the sliced data.
Extra
Similar performance metrics were measured for other datasets, and yielded similar results. Shown below are the results for an aerial image dataset taken in high resolution (4000 x 3000).

Data Cropping increases the Mean Average Precision by 4 to 5 times for this dataset. A reason why it is so effective for this dataset is that the objects are very high resolution in the images, allowing for distinct features to be recognized easily when cropped to a similar resolution as the training resolution. Additionally, when cropping the 4000 x 3000 base image to 640 x 640, the small objects become very difficult to detect.
Similar training time and inference speed trends were observed, shown below:


References
[1] Stanford Drone Dataset. In: Computational Vision and Geometry Lab. https://cvgl.stanford.edu/projects/uav_data/. Accessed 2 Dec 2020
[2] A. Robicquet, A. Sadeghian, A. Alahi, S. Savarese, Learning Social Etiquette: Human Trajectory Prediction In Crowded Scenes in European Conference on Computer Vision (ECCV), 2016.
[3] TensorFlow (2020) tensorflow/models. In: GitHub. https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md. Accessed 23 Nov 2020
[4] Rafaelpadilla rafaelpadilla/Object-Detection-Metrics. In: GitHub. https://github.com/rafaelpadilla/Object-Detection-Metrics. Accessed 23 Nov 2020
[5] Padilla R, Netto SL, da Silva AAB (2020) 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), A Survey on Performance Metrics for Object-Detection Algorithms.