DETR, or Detection Transformer, introduced by Nicolas Carion et al. in 2020, from Facebook AI Research, transforms the task of object detection into a query-based problem. Leveraging a transformer architecture, DETR analyzes images in their entirety, enhancing the accuracy of detection.

DETR operates as a 1-stage detector, utilizing a global loss function based on bipartite matching between predictions and labels. This strategy effectively mitigates the complications seen in YOLO of managing multiple objects within a single grid. By adopting this innovative approach, DETR intends to generate a distinct and precise prediction per object, reducing the necessity for complex post-processing methods like Non-Maximum Suppression (NMS).

Using DETR for Inference

DETR first uses a CNN backbone, like ResNet50, to extract a feature map from the input image. This map highlights the image's important details and textures, preparing it for object detection by summarizing the visual information in a useful form.

embed - 2024-02-20T124237.930.svg

After feature extraction, the features further go into the Transformer encoder, helped by positional encoding to keep track of where pixels are. The encoder processes this information, adding understanding of how different parts of the image relate to each other no matter how far away from each other. The result is a set of embeddings that carry the whole scene's context.

These embeddings then helps the decoder make sense of what it sees, ensuring predictions are based on a thorough digest of the whole image.

embed - 2024-02-20T124321.233.svg

In the last step, the DETR decoder uses learned (optimal) queries to find and identify objects in the embeddings i.e., the encoder’s outputs. This is like a search that knows the scene, helping to spot out objects and classify them accurately.

embed - 2024-02-20T173201.414.svg

Subtle: In some cases, the output of the transformer decoder is seen as the final predictions, including bounding boxes, class labels, and confidence scores. However, it's also common to view the last few layers i.e., a shared feedforward neural network, of the decoder, known as the "prediction head", as a separate part that creates bounding boxes, class predictions, and confidence scores.

CNN Feature Extractor

DETR starts with a CNN backbone processing an image of size $[3, H_0, W_0]$, resulting in a feature map with reduced spatial dimensions $[C, H, W]$ (where $C=2048$, $H=\frac{H_0}{32}$, and $W=\frac{W_0}{32}$).

Positional Encoding

Why: In object detection, the positional information between each pixel and its neighbors should be understood by the model. In the self-attention in DETR encoder, only the pixel values are processed by the model, not their positions. Therefore, we need to explicitly embed positional information into the inputs, and this is where positional encoding comes into play.

How-to: DETR utilizes the positional encoding structure from the "Attention is All You Need" paper but extends it to two dimensions ($x$ and $y$). DETR generates positional encodings for $x$ and $y$ separately, following the Transformer's approach, and then simply stacks them together. Given the complexity of calculating positional encodings, pre-existing schemes are generally used.

Transformer Encoder

DETR's Transformer encoder leverages self-attention for global image analysis, outperforming localized convolution methods like YOLO by assessing the entire scene. This global perspective captures long-range dependencies and interactions between objects, enabling DETR to recognize and interpret complex visual relationships of different scale. By focusing dynamically across the scene, DETR achieves a nuanced understanding, enhancing object detection and classification in challenging scenarios and addressing CNNs' limitations in contextual analysis.

Dimensionality Across Steps: In the DETR Transformer Encoder, the input feature map from the CNN, sized $[C, H, W]$, is first streamlined through a $1 \times 1$ convolution, reducing its depth from $C$ to $d$, resulting in a $[d, H, W]$ feature map. This map is then reshaped into a sequence of $[d, HW]$ to meet the Transformer's input format, transforming detailed spatial information into embeddings suitable for attention mechanisms, facilitating contextual and spatial analysis for object detection.