How to Optimize Object Detection Models for Specific Domains

Design better and faster models to solve your specific problem

Published in

Towards Data Science

13 min readOct 19, 2023

Object detection is widely employed across different domains, from academia to industry sectors, thanks to its ability to provide great results at a low computational cost. However, despite the abundance of open-source architectures publicly available, most of these models are designed to address general-purpose problems and may not be a good fit for specific contexts.

As an example, we can mention the Common Objects in Context (COCO) dataset, which is typically used as a baseline for research in this field, influencing the hyperparameters and architectural details of the models. This dataset comprises 90 distinct classes under various lighting conditions, backgrounds, and sizes. It turns out that, sometimes, the detection problem you are facing is relatively simple. You may want to detect just a few distinct objects without many scene or size variations. In this case, if you train your model using a generic set of hyperparameters, you would likely end up with a model that incurs unnecessary computational costs.

With this perspective in mind, the primary goal of this article is to provide guidance on optimizing various object detection models for less complex tasks. I want to assist you in selecting a more efficient configuration that reduces computational costs without prejudicing the mean Average Precision (mAP).

Providing some context

One of the goals of my master’s degree was to develop a sign language recognition system with minimal computational requirements. A crucial component of this system is the preprocessing stage, which involves the detection of the interpreter’s hands and face, as depicted in the figure below:

Samples from the HFSL dataset that were created in this work. Image by author.

As illustrated, this problem is relatively straightforward, involving only two distinct classes and three concurrently appearing objects in the image. For this reason, my aim was to optimize the models' hyperparameters to maintain a high mAP while reducing the computational cost, thus enabling efficient execution on edge devices such as smartphones.

Object detection architectures and setup

In this project, the following object detection architectures were tested: EfficientDetD0, Faster-RCNN, SDD320, SDD640, and YoloV7. However, the concepts presented here can be applied to adapt various other architectures.

For model development, I primarily utilized Python 3.8 and the TensorFlow framework, with the exception of YoloV7, where PyTorch was employed. While most examples provided here relate to TensorFlow, you can adapt these principles to your preferred framework.

In terms of hardware, the testing was conducted using an RTX 3060 GPU and an Intel Core i5–10400 CPU. All the source code and models are available on GitHub.

Fine-tuning of object detectors

When using TensorFlow for object detection, it’s essential to understand that all the hyperparameters are stored in a file named “pipeline.config”. This protobuf file holds the configurations used to train and evaluate the model, and you’ll find it in any pre-trained model downloaded from TF Model Zoo, for instance. In this context, I will describe the modifications I’ve implemented in the pipeline files to optimize the object detectors.

It’s important to note that the hyperparameters provided here were specifically designed for hand and face detection (2 classes, 3 objects). Be sure to adapt them for your own problem domain.

General simplifications

The first change that can applied to all models is reducing the maximum number of predictions per class and the number of generated bounding boxes from 100 to 2 and 4, respectively. You can achieve this by adjusting the “max_number_of_boxes” property inside the “train_config” object:

...
train_config {
  batch_size: 128
  sync_replicas: true
  optimizer { ... }
  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED"
  num_steps: 50000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 4 # <------------------ change this line
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "classification"
  fine_tune_checkpoint_version: V2
}
...

After that, change the “max_total_detections” and the “max_detections_per_class” that are inside the “post_processing” of the object detector:

post_processing {
  batch_non_max_suppression {
    score_threshold: 9.99999993922529e-09
    iou_threshold: 0.6000000238418579
    max_detections_per_class: 2 # <------------------ change this line
    max_total_detections: 4     # <------------------ change this line
    use_static_shapes: false
  }
  score_converter: SIGMOID
}

Those changes are important, especially in my case, as there are only three objects and two classes appearing in the image simultaneously. By decreasing the number of predictions, fewer iterations are required to eliminate overlapping bounding boxes through Non-maximum Suppression (NMS). Therefore, if you have a limited number of classes to detect and objects appearing in the scene, it could be a good idea to change this hyperparameter.

Additional adjustments were applied individually, taking into account the specific architectural details of each object detection model.

Single Shot Multibox Detector (SSD)

It’s always a good idea to test different resolutions when working with object detection. In this project, I utilized two versions of the model, SSD320 and SSD640, with input image resolutions of 320x320 and 640x640 pixels, respectively.

For both models, one of the primary modifications was to reduce the depth of the Feature Pyramid Network (FPN) from 5 to 4 by removing the most superficial layer. FPN is a powerful feature extraction mechanism that operates on multiple feature map sizes. However, for larger objects, the most superficial layer, designed for higher image resolutions, might not be necessary. That said, if the objects that you are trying to detect are not too small, it’s probably a good idea to remove this layer. To implement this change, adjust the “min_level” attribute from 3 to 4 within the “fpn” object:

...
feature_extractor {
  type: "ssd_mobilenet_v2_fpn_keras"
  depth_multiplier: 1.0
  min_depth: 16
  conv_hyperparams {
    regularizer { ... }
    initializer { ... }
    activation: RELU_6
    batch_norm {...}
  }
  use_depthwise: true
  override_base_feature_extractor_hyperparams: true
  fpn {
    min_level: 4 # <------------------ change this line
    max_level: 7
    additional_layer_depth: 108 # <------------------ change this line
  }
}
...

I also simplified the higher-resolution model (SSD640) by reducing the “additional_layer_depth” from 128 to 108. Likewise, I adjusted the “multiscale_anchor_generator” depth from 5 to 4 layers for both models, as shown below:

...
anchor_generator {
  multiscale_anchor_generator {
    min_level: 4 # <------------------ change this line
    max_level: 7
    anchor_scale: 4.0
    aspect_ratios: 1.0
    aspect_ratios: 2.0
    aspect_ratios: 0.5
    scales_per_octave: 2
  }
}
...

Finally, the network responsible for generating the bounding box predictions (“box_predictor”) had the number of layers reduced from 4 to 3. Regarding SSD640, the box predictor depth was also decreased from 128 to 96, as shown below:

...
box_predictor {
  weight_shared_convolutional_box_predictor {
    conv_hyperparams {
      regularizer { ... }
      initializer { ... }
      activation: RELU_6
      batch_norm { ... }
    }
    depth: 96 # <------------------ change this line
    num_layers_before_predictor: 3 # <------------------ change this line
    kernel_size: 3
    class_prediction_bias_init: -4.599999904632568
    share_prediction_tower: true
    use_depthwise: true
  }
}
...

These simplifications were driven by the fact that we have a limited number of distinct classes with relatively straightforward patterns to detect. Therefore, it’s possible to reduce the number of layers and the depth of the model, since even with fewer feature maps we can still effectively extract the desired features from the images.

EfficinetDet-D0

Concerning EfficientDet-D0, I reduced the depth of the Bidirectional Feature Pyramid Network (Bi-FPN) from 5 to 4. Additionally, I decreased the Bi-FPN iterations from 3 to 2 and feature map kernels from 64 to 48. Bi-FPN is a sophisticated technique of multi-scale feature fusion, which can yield excellent results. However, it comes at the cost of higher computational demands, which can be a waste of resources for simpler problems. To implement the aforementioned adjustments, simply update the attributes of the “bifpn” object as follows:

...
bifpn {
      min_level: 4 # <------------------ change this line
      max_level: 7
      num_iterations: 2 # <------------------ change this line
      numyaml_filters: 48 # <------------------ change this line
    }
...

Besides that, it’s also important to reduce the depth of the “multiscale_anchor_generator” in the same manner as we did with SSD. Lastly, I reduced the layers of the box predictor network from 3 to 2:

...
box_predictor {
  weight_shared_convolutional_box_predictor {
    conv_hyperparams {
      regularizer { ... }
      initializer { ... }
      activation: SWISH
      batch_norm { ... }
      force_use_bias: true
    }
    depth: 64
    num_layers_before_predictor: 2 # <------------------ change this line
    kernel_size: 3
    class_prediction_bias_init: -4.599999904632568
    use_depthwise: true
  }
}
...

Faster R-CNN

The Faster R-CNN model relies on the Region Proposal Network (RPN) and anchor boxes as its primary techniques. Anchors are the central point of a sliding window that iterates over the last feature map of the backbone CNN. For each iteration, a classifier determines the probability of a proposal containing an object, while a regressor adjusts the bounding box coordinates. To ensure the detector is translation-invariant, it employs three different scales and three aspect ratios for the anchor boxes, which increases the number of proposals per iteration.

Although this is a shallow explanation, it’s apparent that this model is considerably more complex than the others due to its two-stage detection process. However, it’s possible to simplify it and enhance its speed while retaining its high accuracy.

To do so, the first important modification involves reducing the number of generated proposals from 300 to 50. This reduction is feasible because there are only a few objects present in the image simultaneously. You can implement this change by adjusting the “first_stage_max_proposals” property, as demonstrated below:

...
first_stage_box_predictor_conv_hyperparams {
  op: CONV
  regularizer { ... }
  initializer { ... }
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 50 # <------------------ change this line
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
...

After that, I eliminated the largest anchor box scale (2.0) from the model. This change was made because the hands and face maintain a consistent size due to the interpreter’s fixed distance from the camera, and having large anchor boxes might not be useful for proposal generation. Additionally, I removed one of the aspect ratios of the anchor boxes, given that my objects have similar shapes with minimal variation in the dataset. These adjustments are visually represented below:

first_stage_anchor_generator {
  grid_anchor_generator {
    scales: [0.25, 0.5, 1.0] # <------------------ change this line
    aspect_ratios: [0.5, 1.0] # <------------------ change this line
    height_stride: 16
    width_stride: 16
  }
}

That said, it’s crucial to consider the size and aspect ratios of your target objects. This consideration allows you to eliminate less useful anchor boxes and significantly decrease the computational cost of the model.

YoloV7

In contrast, minimal changes were applied to YoloV7 to preserve the architecture’s functionality. The main modification involved simplifying the CNN responsible for feature extraction, in both the backbone and the model’s head. To achieve this, I decreased the number of kernels/feature maps for nearly every convolutional layer, creating the following model:

backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [22, 3, 1]],  # 0
   [-1, 1, Conv, [44, 3, 2]],  # 1-P1/2      
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [89, 3, 2]],  # 3-P2/4  
   [-1, 1, Conv, [44, 1, 1]],
   [-2, 1, Conv, [44, 1, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],  # 11
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 16-P3/8  
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [512, 1, 1]],  # 24
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 29-P4/16  
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [716, 1, 1]],  # 37
   [-1, 1, MP, []],
   [-1, 1, Conv, [256, 1, 1]],
   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [256, 3, 2]],
   [[-1, -3], 1, Concat, [1]],  # 42-P5/32  
   [-1, 1, Conv, [128, 1, 1]],
   [-2, 1, Conv, [128, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -3, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [716, 1, 1]],  # 50
  ]

# yolov7 head
head:
  [[-1, 1, SPPCSPC, [358]], # 51
   [-1, 1, Conv, [179, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [37, 1, Conv, [179, 1, 1]], # route backbone P4
   [[-1, -2], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]], # 63
   [-1, 1, Conv, [89, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [24, 1, Conv, [89, 1, 1]], # route backbone P3
   [[-1, -2], 1, Concat, [1]],
   [-1, 1, Conv, [89, 1, 1]],
   [-2, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [-1, 1, Conv, [44, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [89, 1, 1]], # 75
   [-1, 1, MP, []],
   [-1, 1, Conv, [89, 1, 1]],
   [-3, 1, Conv, [89, 1, 1]],
   [-1, 1, Conv, [89, 3, 2]],
   [[-1, -3, 63], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [-1, 1, Conv, [89, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]], # 88
   [-1, 1, MP, []],
   [-1, 1, Conv, [179, 1, 1]],
   [-3, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [179, 3, 2]],
   [[-1, -3, 51], 1, Concat, [1]],
   [-1, 1, Conv, [179, 1, 1]],
   [-2, 1, Conv, [179, 1, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [-1, 1, Conv, [128, 3, 1]],
   [[-1, -2, -3, -4, -5, -6], 1, Concat, [1]],
   [-1, 1, Conv, [358, 1, 1]], # 101
   [75, 1, RepConv, [179, 3, 1]],
   [88, 1, RepConv, [358, 3, 1]],
   [101, 1, RepConv, [716, 3, 1]],
   [[102,103,104], 1, IDetect, [nc, anchors]],   # Detect(P3, P4, P5)
  ]

As discussed earlier, removing some layers and feature maps from the detectors is typically a good approach for simpler problems, since feature extractors are initially designed to detect dozens or even hundreds of classes in diverse scenarios, requiring a more robust model to address these complexities and ensure high accuracy.

With these adjustments, I decreased the number of parameters from 36.4 million to just 14.1 million, representing a reduction of approximately 61%. Furthermore, I used an input resolution of 512x512 pixels instead of the suggested 640x640 pixels in the original paper.

Additional tip

Another valuable tip in the training of object detectors is to utilize the Kmeans model for unsupervised adjustment of the anchor box proportions, fitting the width and height of the figures to maximize the ratio of Intersection over Union (IoU) within the training set. By doing this, we can better adapt the anchors to the given problem domain, thereby enhancing model convergence by starting with adequate aspect ratios. The figure below exemplifies this process, comparing three anchor boxes used by default in the SSD algorithm (in red) next to three boxes with optimized proportions for the hand and face detection task (in green).

Comparing different bounding boxes' aspect ratios. Image by author.

Showing the results

I trained and evaluated each detector using my own dataset, called the Hand and Face Sign Language (HFSL) dataset, considering the mAP and the Frames Per Second (FPS) as the main metrics. The table below provides a summary of the results, with values in parentheses representing the FPS of the detector before implementing any of the described optimizations.

We can observe that most of the models showed a significant reduction in inference time while maintaining a high mAP across various levels of Intersection over Union (IoU). More complex architectures, such as Faster R-CNN and EfficientDet, increased the FPS on GPU by 200.80% and 231.78%, respectively. Even SSD-based architectures showed a huge increase in performance, with 280.23% and 159.59% improvements for the 640 and 320 versions, respectively. Considering YoloV7, although the FPS difference is most noticeable on the CPU, the optimized model has 61% fewer parameters, reducing memory requirements and making it more suitable for edge devices.

Conclusion

There are instances when computational resources are limited, or tasks must be executed quickly. In such scenarios, we can further optimize the open-source object detection models to find a combination of hyperparameters that can reduce the computational requirements without affecting the results, thereby offering a suitable solution for diverse problem domains.

I hope this article has assisted you in making better choices to train your object detectors, resulting in significant efficiency gains with minimal effort. If you didn’t understand some of the explained concepts, I recommend you dive deeper into how your object detection architecture works. Additionally, consider experimenting with different hyperparameter values to further streamline your models based on the specific problem you are addressing!