Most of the great recent machine learning papers are based on transformers. They are powerful effective machine learning models that have proven they are worth the time and effort to optimize. Recently, Facebook published a new paper that uses transformers to outperform state-of-the-art Faster RCNNs in object detection.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.The new model is conceptually simple and does not require a specialized library. We show that it significantly outperforms com- petitive baselines. Training code and pretrained models are available at https://github.com/facebookresearch/detr.
Source: arxiv
The paper examines the weakness of current object detection solutions such as post-processing steps to collapse near-duplicate predictions [1] and propose novel solutions using the attention mechanism offered by transformers. The paper also introduces an end-to-end framework without any customizable layers and with code for ease of utilization and reproduction.
We have all seen the impact transformers have had on NLP and image classification and so it’s exciting to see it being applied to object detection, which is generally more challenging than classic image recognition. The power of the self-attention mechanism helps in relaxing the constraints of current object detection solutions. Without further ado, let’s start diving into how this model works!
Pre-requisites
Before we start, I wanted to cover some fundamental ideas that I think are quite interesting, essential to understanding this model, and are also quite useful in general.
A bipartite graph (or bigraph) is a graph whose vertices can be divided into two disjoint and independent sets U and V such that every edge connects a vertex in U to one in V. Vertex sets and are usually called the parts of the graph.
Source: Wikipedia
The first one is bipartite matching. This stems from graph theory which is one of the fundamental building blocks of computer science. Bipartite matching is where a set of edges in a bipartite graph is chosen such that no 2 edges share a node. If you are not that interested in graph theory, just understand that graph matching algorithms can be quite powerful in Neural networks.
The second interesting concept is anchor boxes. In object detection, there are mainly 2 types of models: single-stage and 2-stage. 2-stage models produce a set of a proposal from an image (essentially sub-images) using an algorithm called Selective Search and then attempt to classify each image using a classic CNN. However, single-stage models (which are more effective) use a more novel concept called anchor boxes.
Anchor boxes are kind of the opposite of region proposals, they are a set of predefined bounding boxes of a certain height and width that are used by the network to approximate the objects it is looking for. Although this might sound a bit more constrained, it has been shown that anchor boxes are much more effective and quicker than region proposals.
The final concept is non-maximum suppression (NMS) and this one is quite fundamental to object detection. NMS is used to select one bounding box from many overlapping boxes. Because think about it, if you are trying to find a cat, there are going to be several boxes covering that cat from multiple angles. However, a high-quality object detection model will use NMS which calculates the Intersection Over Union of the boxes and chooses the optimal box from a defined threshold.
Okay now that we are done with the theory, let’s get to it.
The end-to-end pipeline

The model starts off with a CNN to extract features, then feeds those features to a transformer, and the output of the transformer is then fed into a Feedforward network that makes the final prediction. I have been reviewing a lot of recent popular papers and this pipeline is being adapted quite frequently, for instance, it was also recently used here:
Facebook & NYU reduce Covid hospital strain – Covid Prognosis Via Self-Supervised Learning
Our loss produces an optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses.
Source: arxiv
Their main loss function is based on bipartite matching and more specifically the Hungarian algorithm which is a "combinatorial optimization algorithm that solves the assignment problem in polynomial time" [2]. I could write a full post about the Hungarian algorithm (but I am not going to). One of the main benefits of this approach is that it simply produces a set of predictions rather than a list, meaning that the produced boxes are unique (which saves a lot of post-processing time). It also allows them to do box predictions directly rather than doing those predictions with respect to some initial guesses (with the addition of some regularisation of course).
As for the first step, a CNN is used to drive down the dimensions of the images to the most essential ones, which is called an activation map. Then a 1 x 1 convolution further drives down those dimensions and after collapsing the feature map we get a sequence which the transformer expects as input. But before moving to the transformer, the image is also supplemented with positional encodings (to preserve the structure of the image).
As for the transformer, for the sake of efficiency, they adjust the classic transformer decoder module to decode the objects in parallel rather than sequentially. This is possible because the transformer is permutation-invariant and because we already passed in the positional encodings so the image structure can be memorized safely even within a parallel processing pipeline.
Using self- and encoder-decoder attention over these embeddings, the model globally reasons about all objects together using pair-wise relations between them, while being able to use the whole image as context.
Source: arxiv
This is one of my favorite optimizations introduced in this paper. Not only are they using a modern model such as the transformer, but they also managed to improve such that they can break down the image, process it in parallel without losing its structure.
Results

I don’t want to talk too much about the results since it’s all data and can be easily checked in the paper, but I think it’s safe to say that the results are quite good. They have also implemented the paper using Pytorch and provided the code [here](https://colab.research.google.com/github/facebookresearch/detr/blob/colab/notebooks/detr_demo.ipynb). The code doesn’t seem to be long and complicated, they actually introduce a demo here in less than 100 lines of code!
Final thoughts
One of the best things about recent object detection models is that they don’t require you to write code (kind of) to train your models, you can just run "python train.py " and it starts training. You probably need to process your dataset first, but that’s quite minimalistic work. You can find their main python training file here. I was thinking about making a tutorial using this code in a real Kaggle challenge (I am doing the VinBiGdata x-ray object detection challenge). Let me know down in the comments if that would be interesting to you.
If you want to receive regular paper reviews about the latest papers in AI & Machine Learning, add your email here & Subscribe!
https://artisanal-motivator-8249.ck.page/5524b8f934
References:
[1] End-to-End Object Detection with Transformers. Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko.