DETR, or Detection Transformer, introduced by Nicolas Carion et al. in 2020, from Facebook AI Research, transforms the task of object detection into a query-based problem. Leveraging a transformer architecture, DETR analyzes images in their entirety, enhancing the accuracy of detection.
DETR operates as a 1-stage detector, utilizing a global loss function based on bipartite matching between predictions and labels. This strategy effectively mitigates the complications seen in YOLO of managing multiple objects within a single grid. By adopting this innovative approach, DETR intends to generate a distinct and precise prediction per object, reducing the necessity for complex post-processing methods like Non-Maximum Suppression (NMS).
DETR first uses a CNN backbone, like ResNet50, to extract a feature map from the input image. This map highlights the image's important details and textures, preparing it for object detection by summarizing the visual information in a useful form.
After feature extraction, the features further go into the Transformer encoder, helped by positional encoding to keep track of where pixels are. The encoder processes this information, adding understanding of how different parts of the image relate to each other no matter how far away from each other. The result is a set of embeddings that carry the whole scene's context.
These embeddings then helps the decoder make sense of what it sees, ensuring predictions are based on a thorough digest of the whole image.
In the last step, the DETR decoder uses learned (optimal) queries to find and identify objects in the embeddings i.e., the encoder’s outputs. This is like a search that knows the scene, helping to spot out objects and classify them accurately.
Subtle: In some cases, the output of the transformer decoder is seen as the final predictions, including bounding boxes, class labels, and confidence scores. However, it's also common to view the last few layers i.e., a shared feedforward neural network, of the decoder, known as the "prediction head", as a separate part that creates bounding boxes, class predictions, and confidence scores.
DETR starts with a CNN backbone processing an image of size $[3, H_0, W_0]$, resulting in a feature map with reduced spatial dimensions $[C, H, W]$ (where $C=2048$, $H=\frac{H_0}{32}$, and $W=\frac{W_0}{32}$).
Because we have not yet started discussing the Transformer's self-attention mechanism, the need for positional encoding might not be immediately clear. Simply put, unlike MLPs and CNNs, the self-attention mechanism is a computation module that is agnostic to the position of its inputs. Consider a function $Y, X = f(x, y)$, where $f$ represents the self-attention mechanism, and $x$ and $y$ are inputs at a certain position. If we were to swap the positions of $x$ and $y$, the positions of the outputs would also swap, resulting in $X,Y=f(y,x)$, yet the output values for $X$ and $Y$ would remain unchanged.
Challenge: while the self-attention module can learn the relationship between one input and all other inputs, it does not inherently understand the distance or positional relationship between these inputs. This concept of distance is crucial in both natural language processing and image processing contexts e.g., "see you" is different from "you see".
Of course, there are exceptions, such as using self-attention for analyzing multi-channel data (e.g., anomaly detection) where the channels do not have an inherent order; in such cases, the position-agnostic nature of the computation can actually be beneficial.
Why: In object detection, the positional information between each pixel and its neighbors should be understood by the model. In the self-attention in DETR encoder, only the pixel values are processed by the model, not their positions. Therefore, we need to explicitly embed positional information into the inputs, and this is where positional encoding comes into play.
How-to: DETR utilizes the positional encoding structure from the "Attention is All You Need" paper but extends it to two dimensions ($x$ and $y$). DETR generates positional encodings for x and y separately, following the Transformer's approach, and then simply stacks them together. Given the complexity of calculating positional encodings, pre-existing schemes are generally used.