This post is a summary and exploration of a research paper which describes the detection transformer architecture. This post is useful to get a high level understanding of the components of the system, and to be complementary to the reading of the full length paper. We hope to explore more such research articles in the future.
DETR (DEtection TRansformer) : End-to-End Object Detection with Transformers
The DETR architecture published by researchers from Facebook AI Research (FAIR) is an attempt to recast the problem of object detection in images as a direct set prediction problem. Current methods in object detection attempt to model it as a task of fitting and fine-tuning a prior (anchor boxes, selective search based heuristics) by predicting features relative to other features, and regressing to improve the bounding box predictions.
The architectures suffer from multiple challenges, and require handcrafted components and filters including Non-max suppression, anchor boxes and thresholds. Though there has been an improvement over time, from the days of RCNN to Faster RCNN and YOLOv1 to v5, the elaborate complexity has remained.
DETR attempts to simplify the pipeline for Object Detection. It introduces a set based global loss (it isn’t as complicated as it sounds). The claim is that it works at par with algorithms that have gone through iterations of optimization like Faster RCNN.
High Level Overview of the Architecture
Figure 1: DETR Architecture
CNN Feature Extractor: The CNN feature extractor is similar to a standard fully convolutional backbone. It converts the input image into a feature vector of size H x W x C, which is subsequently flattened to C vectors of size H x W to serve as input to the transformer.
Transformer Encoder Decoder: There are two components to this part of the pipeline:
Transformer Encoder: The input to this layer is a feature vector of length H x W. The encoder
models the input to generate a vector based on the computations performed within the various attention blocks in the Transformer Architecture (Read Appendix of the Paper for more details).
Transformer Decoder: The decoder is also a Transformer block, which codes the N objects in parallel. Fixed positional embeddings called object queries are then passed as input to the decoder and combined with the encoder input results and these are provided to the next part of the architecture.
Prediction Heads: The prediction heads are feed-forward neural networks with ReLU activation, these predict the normalized center, co-ordinates, height and width with respect to the input image. A fixed set of N bounding boxes are predicted and a null object is used to represent no object detected in a slot.
Set prediction Loss:
DETR infers a fixed set of N predictions. Where N >> Number of classes in the dataset.
The loss produces an optimal bipartite matching between predicted and ground truth objects. Then subsequently optimized object-specific losses.
Figure 2: Bipartite Matching Loss
yi = Ground Truth Set
ŷ = Prediction Set from DETR
Lmatch is the pair-wise matching cost between ground truth and predictions, this cost is to be minimized and is computed via the Hungarian Loss.
Instead of just defining the loss in equation terms, here is an example derived from the excellent video by Yannic Kilcher:
Let us assume N = 5,
|(c1, b)||(c2, b)|
|(ɸ, b)||(ɸ, ~b)|
|(ɸ, b)||(c2, b)|
|(c2, b)||(ɸ, ~b)|
|(c1, b)||(c1, b)|
Where ɸ implies of the object,
In the optimal form, after minimizing Hungarian Loss for this particular forward pass, we would potentially have something similar to:
|(c2, b1)||(c2, b)|
|(ɸ, b2)||(ɸ, ~b)|
|(c2, b4)||(c2, b)|
|(ɸ, b3)||(ɸ, ~b)|
|(c1, b1)||(c1, b)|
Subsequently a Bounding Box Loss is computed which is a linear combination of L1 and generalized IoU loss.
These are then normalized by the number of objects in the batch.
The equation for the same comes out to be:
Fig 3: Hungarian Loss
Where σ(i) is the optimal assignment computed in the first step.
An important focus in the author’s approach has been the ease of use and simplicity of the architecture itself. In the paper they describe the inference code as simple as follows:
The repository for the paper is linked in the references, feel free to discuss and ask any questions!
Thank you for reading.