Review on Few-Shot Object Detection

Introduction and overview of few-shot object detection

Published in

Towards Data Science

10 min readNov 28, 2021

Photo by Glenn Carstens-Peters on Unsplash

Deep learning solutions for classification and object detection are state of the art in computer vision and that’s not news anymore. Despite the high accuracy and speed of recent SOTA algorithms, there is one big issue: for a good-performing solution, we need a huge amount of data. In addition, the data must be annotated, which requires a lot of manual work. That was the reason for the development of several new paradigms like self-supervised learning and few-shot learning.

Recent progress in the few-shot classification helped to significantly improve the performance of “learn to learn” problem in classification, however few-shot object detection (FSOD) has large potential to grow and improve.
Large amount of research has been done around this topic. Yet, a huge performance gap exists compared to classic object detection or few-shot classification.

Let’s now dive into FSOD more profoundly and understand:

Problem definition of few-shot object detection
Benchmark 3 recent SOTA algorithms
Review and compare the 3 above-mentioned papers.

Few Shot Object Detection

Few-shot object detection aims to generalize on novel objects using limited supervision and annotated samples. Let (S1, … Sn) be a set of support classes and Q be a query image with multiple instances and backgrounds. For the given (S1, … Sn) and Q models aim to detect and localize all objects from support sets found in Q. During the training most FSOD applications divide classes into two non-overlapping parts: base and novel classes. Training dataset includes base classes to train the baseline model. Then, model is finetuned where a combined dataset of base and novel classes is used. The last stage includes testing on a dataset composed of only novel classes

This is the most popular problem definition for few-shot object detection. It may slightly differ from paper to paper. However the main idea is common for every research: create a model which can find objects on new, never-seen classes in a class-agnostic way.

Benchmark

Two popular few shot object detection tasks are used for benchmark: MS-COCO on 10-shot and MS-COCO on 30-shot. Let’s look at the top 3 models for each of these tasks:

Depending on the task these 3 algorithms outperform others, however, there is a huge gap in accuracy between classic object detection tasks and few-shot object detection. Let’s dive into each of these to understand the structures and differences of each of these.

DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection

As the name of the paper implies the model modifies Faster R-CNN for few-shot object detection. Faster R-CNN consists of 3 blocks: “shared convolutional backbone for extracting generalized features, Region Proposal Network (RPN) for generating class-agnostic proposals, and a task-specific RCNN head for performing class-relevant classification and localization” [1].

Image 1: Faster R-CNN and Decoupled Faster R_CNN Architectures. Image source from paper [1] (see references)

To modify Faster R-CNN to work on few-shot settings authors are trying to solve two problems:

The problem of multi-task learning: R-CNN head of model is responsible for classification, in other words, what to look at, whilst RPN head aims to understand where to look, it solves localization problem. “First head needs translation invariant features whereas the localization head needs translation covariant features”[1]. The joint optimization of these two heads in the case of FSOD can lead to worse results when we have individual small tasks.
The problem of shared backbone. As we can see in the image above, Faster R-CNN has one shared backbone for 2 heads. It works very well in object detection, but in a few shot settings, there can be a decrease in accuracy when fine-tuning for novel classes. Foreground-background confusion can arise, which means background in base training can become foreground in the novel fine-tuning phase. That is why gradients from RPN cause overfitting of shared backbone and model can’t converge.

To solve these problems, the authors suggest changing the model by adding two modules: Gradient Decoupled Layers (GDL) and Prototypical Calibration Block (PCB) to improve object detection in few-shot settings. Let’s look at each of these modules closely.

Image 2: DeFRCN Architecture. Image source from paper [1] (see references)

As we can see in Image 2 two Gradient Decoupled Layers (GDL)are placed after the backbone. “GDL does an affine transformation, which is parameterized by learnable channel-wise weights. During the forward propagation, the features from the shared backbone are transformed into different feature spaces through Aᵣₚₙ and Aᵣ𝒸ₙₙ”[1].

Constant λ ∈ [0, 1] is defined, which helps to set the decoupling degree of backbone, RPN, and RCNN during backpropagation. The decoupling degree can be adjusted by λᵣₚₙ and λᵣ𝒸ₙₙ which roughly speaking decides how much gradients will affect on preceding layer. To stop updates from RPN or RCNN we can set λ = 0. Elsewise, we can scale gradients by setting λ larger than 0. In other words, λᵣₚₙ and λᵣ𝒸ₙₙ decide the individual contribution of each RPN and RCNN on a shared backbone. This solves the second problem of shared backbone.

The next input of the paper is the Prototypical Calibration Block. “Classification needs translation invariant features whereas localization needs translation covariant features. Thus the localization branch may force the backbone to gradually learn translation covariant properties, which potentially downgrades the performance of the classifier. PCB consists of a strong classifier from ImageNet pre-trained model, a RoIAlign layer, and a prototype bank”[1].

First, it calculated prototypes from the support set. Let’s define a few-shot problem with M-way K-shot setting. PCB extracts feature maps from support set images. After aligning these feature maps with ground-truth boxes using RoIAlign it produces MK instance representations. Then it shrinks the representations to a prototype bank.

For object proposals from fine-tuned few-shot detector PCB first performs RoIAling on the predicted box and generates object features. Then it calculates cosine similarity sᵢᶜᵒˢ between this object feature and support set prototypes. This similarity is used as a score for the predicted category. To calculate the final classification score model performs weighted aggregation between the score given from PCB ( sᵢᶜᵒˢ) and the score from the fine-tuned few-shot detector (sᵢ ).

There are no shared “parameters between the few-shot detector and PCB module so that the PCB can not only preserve the quality of the classification-aimed translation invariance feature but also better decouple the classification task and regression task within the RCNN. Furthermore, since the PCB module is offline without any further training, it can be plug-and-play and easily equipped to any other architecture to build stronger few-shot detectors”[1].

Dual-Awareness Attention for Few-Shot Object Detection

Authors of this paper address two other problems:

Problem with quality of support features: during FSOD we have limited support information about novel objects, thus we need high-quality features for better results. To solve this problem authors are trying to undermine the influence of noise.
Problem with correlation: Again with a limited number of examples it’s hard to get high correlation between support and query sets. Here, the goal is to improve object-wise correlations.

The authors present a novel mechanism called Dual-Awareness Attention (DAnA) which combines two new modules called Background Attenuation (BA) and Cross-Image Spatial Attention (CISA).

To undermine the noise influence background attenuation block (BA) was suggested. The following image shows the structure of this module. First, it reshapes the support feature map and transforms it by linear learnable matrix Wₑ.

Image 3: Background Attenuation Block. Image source from paper [2] (see references)

The attention map is calculated using each pixel of the support feature map and Wₑ. Next, they combine the attention map with support features which gives the most important features. In the end, leaky relu is used for a softer attention strategy. This transformation helps to get more discriminative support features.

The next block presented in the paper is called Cross-Image Spatial Attention Block (CISA). The goal of this block is to help the model to focus on the most representative parts of the objects to determine intra-class similarity.

Image 4: Cross-Image Spatial Attention Block. Image source from paper [2] (see references)

“The core idea of Cross-Image Spatial Attention (CISA) is to adaptively transform each support feature map into query position-aware (QPA) support vectors that represent specific information of a support image”[2].

Let denote Z as support feature map processed in BA block and X as a query feature map. CISA transforms X and Z into the query and key embeddings 𝑸 and 𝓚 using W𝓺 and Wₖ weight matrices.

Then it calculates similarity scores between query and support:

Where µ𝑸 and µ𝓚 are the averaged embedding values over all pixels.

For the next step, self-attention is added as authors assume that attention should be based not only on query-support correlations but also on the support image itself.

𝛽 is a constant coefficient.

The final transformation is a calculation of query position-aware vectors by multiplying the results described above with position-aware support feature map Z. The result is ready to be passed to object detection.

DAnA can be combined with object detection frameworks. For that reason, CISA output can be combined with a query feature map and sent to the modules such as region proposal network. The authors experimented on Faster-RCNN and RetinaNet.

Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation

Most few-shot object detection frameworks combine meta-learning techniques with object detection models. Meta-DETR does the same. Most approaches are based on Faster R-CNN or similar object detection algorithms. Despite the achievements of these models researchers of this paper see two major issues which they address and try to improve with their solution.

Image 5: Issue of poor classification between similar classes. Image source from paper [3] (see references)

The problem of region proposals: This may work well on a large number of images however under a few-shot setup we have only a limited number of examples per class. Moreover, we try to generalize on novel classes, which makes it harder to get high-quality region proposals.
The problem of poorly defined meta-learning tasks. Each support class is treated independently which causes the problem of well distinguishing similar classes like bikes and motorcycles, cows and sheep, etc (Image 5).

Image 6 describes the whole Meta-DETR algorithm. First, a feature extractor with shared weights is used for query and support images.

Image 6: Meta-DETR algorithm architecture. Image source from paper [3] (see references)

To solve the problem with high correlation between similar classes, the authors suggest a new module called Correlational Aggregation Module (CAM). It aggregates query features with support classes for class-agnostic prediction. The main difference from other methods is that it can aggregate multiple support classes simultaneously, which helps to capture inter-class correlation and reduce misclassification. CAM first matches the query features with a set of support classes. Then it maps the set of support classes to a set of pre-defined task encodings that differentiate these support classes in a class agnostic manner.

CAM outputs support-aggregated query features which then become the input of a transformer-based model for object detection. Recently proposed Deformable DETR was used for object detection. DETR uses a transformer technique for one stage object detection, hence there are no region proposals in the algorithm. This solves the next issue presented in the paper: poor quality region proposals.

Hungarian Loss was used for the model, the same as for Deformable DETR. Additionally, cosine similarity cross-entropy is used after CAM to classify class prototypes.

The training procedure for the algorithm is the same as described above. First, the base dataset is used for full training, then novel classes are used with base classes for fine-tuning.

Conclusion

The authors of each paper reviewed here are trying to solve few-shot object detection problem with a new creative approach. As was mentioned above, there is a huge gap between the accuracies of classic object detection and FSOD. However, this paradigm has huge potential and hopefully one day we can have FSOD algorithms as effective as classical object detection.

References

[1] Qiao, L., Zhao, Y., Li, Z., Qiu, X., Wu, J., & Zhang, C. DeFRCN: Decoupled Faster R-CNN for Few-Shot Object Detection (2021).| Github

[2]Chen, T., Liu, Y., Su, H., Chang, Y., Lin, Y., Yeh, J., & Hsu, W.H. Dual-Awareness Attention for Few-Shot Object Detection (2021).| Github

[3] Zhang, G., Luo, Z., Cui, K., Lu, S., Meta-DETR: Image-Level Few-Shot Object Detection with Inter-Class Correlation Exploitation (2021).| Github