Training DETR on Your Own Dataset

What is DETR? What sets it apart in the world of object detection algorithms? And how do you train it with your own data?

Published in

Towards Data Science

5 min readMar 9, 2021

DETR — or DEtection TRansformer — is Facebook’s newest addition to the market of available deep learning-based object detection solutions. Very simply, it utilizes the transformer architecture to generate predictions of objects and their position in an image.

What is DETR?

Source: https://scontent.fbll1-1.fna.fbcdn.net/v/t39.8562-6/101177000_245125840263462_1160672288488554496_n.pdf?_nc_cat=104&ccb=1-3&_nc_sid=ae5e01&_nc_ohc=OKIPD6OvNkkAX968FLn&_nc_ht=scontent.fbll1-1.fna&oh=ab6dcdc2307471a56fdbe2ebb98faf6b&oe=606ABD47

DETR is a joint Convolutional Neural Network (CNN) and Transformer with a feed-forward network as a head. This architecture allows the network to reliably reason about object relations in the image using the powerful multi-head attention mechanism inherent in the Transformer architecture using features extracted by the CNN.

Source: https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_demo.ipynb

Simply put, the network is able to estimate the for any point in the image its relation to the rest of the image. Given a sample point, the self-attention maps indicate the likelihood of the surrounding areas being positively related to that point. In the above example, this means that despite there being two cats and two remotes the network’s attention can tell the different instances apart by looking at each area in relation to others.

Source: https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_attention.ipynb

The image above shows four sample points and their self-attention maps. These indicate the areas the network assume are related to each point. In the end, each attention head can be gathered in a solution that encompasses a number of bounding boxes and matching probability distribution. This solution is based on a number of object queries, or learned positional embeddings which provide these sample points automatically. Each output embedding is passed through a shared feed-forward network which estimates the aforementioned detection (bounding box and class probability) or no object. The result of the process is shown below:

How does DETR differ from other Object Detectors?

All object detection algorithms have their pros and cons; R-CNN (and its derivatives) is a 2-step algorithm, requiring both a region proposal computation and a detection computation. More recent advances such as SSD (Single-Shot Multibox Detector) and YOLO (You Only Look Once) are called single-stage object detectors because they compute their estimates in a single forward pass. The specifics of how they achieve this differ, but one thing these networks have in common is the requirement of “priors”.

Priors are anchor boxes, scales, grid sizes, duplicate detection identification algorithms, and similar techniques used to reduce the dimensionality of the problem. They ultimately serve the purpose of downsampling the input, or in some cases to avoid the computationally expensive region proposal step. By applying these techniques, the problem shifts from a computational issue to a problem for the human observer, since they have to estimate and provide these priors before training can begin.

In contrast, DETR is a direct set solution to the object detection problem. Based on the aforementioned learned positional embeddings, the network provides a number of estimated bounding boxes and class scores. DETR is not free from prior information, however. For example, the number of estimated bounding boxes before confidence thresholding needs to be manually set before training can begin, and should be “significantly larger than the typical number of objects in an image”.

How do I train DETR for myself?

DETR usually requires a very intensive training schedule. In the original paper, the authors train their model using 16 Nvidia V100 GPUs over 300 epochs, totaling over 3 days of training time (and about 4000 USD at AWS). Feel free to try this yourself, but that is not what this section is about. We are interested in fine-tuning a pretrained DETR model on a personal dataset, potentially with a different number of classes than COCO.

Our DETR fork (found here: https://github.com/aivclab/detr) allows us to do this by changing a few key elements. First, we change the model-building structure to enable any amount of classes. Then, we supply the pretrained model’s weights (without the class embeddings) to the builder. Additionally, we change the maximum width of images in the random transform to 800 pixels. This should allow for training on most GPUs, but it is advisable to change back to the original 1333 if your GPU can handle it. Per default, we reduce the learning rate for the head and the learning rate for the backbone to 1e-5 and 1e-6 respectively, but you can play around with this for yourself.

Depending on the number of samples you have in your dataset, it may be preferable to retrain the model from scratch. The original COCO dataset has over 200,000 annotated images with over 1.5 million instances split over 90 classes, so if your dataset is comparable it may improve your training schedule to set pretrained = False in the notebook.

The notebook is found here: https://github.com/aivclab/detr/blob/master/finetune_detr.ipynb

Below, the sample parameters show how simply you can finetune such a model for yourself:

Finetuning DETR is easy. Source: https://github.com/aivclab/detr/blob/master/finetune_detr.ipynb

Conclusion

DETR is an exciting step forward in the world of object detection. It marks a significant reduction in priors and a simple, easy to configure network architecture. It outperforms Faster R-CNN in most tasks without much specialized additional work, though it is still slower than comparable single-stage object detectors. Its simple structure makes it easy to recreate, experiment with, and finetune from the strong baseline provided by the researchers.

I work at the Alexandra Institute, a Danish non-profit company specializing in state of the art IT solutions. Here in the Visual Computing Lab, we focus on utilizing the newest in computer vision and computer graphics research. We are currently exploring techniques to allow smaller companies and individuals to get started with deep learning. We are always open to collaboration!