Another interesting and effective aspect of YOLO's design is its ability to detect multiple objects within a single grid cell. This capability is particularly important in complex scenes where objects are closely positioned or overlap significantly.

embed (24).svg

This is realized via equiping each grid with multiple anchor boxes. In some versions like YOLOv3, the model is configured to make three predictions (3 anchor boxes) per grid cell. This adjustment expands the neural network's output dimensions from $[85, S, S]$ to $[85 \times 3, S, S]$, enabling each grid cell to predict up to three objects. The prediction stage of this approach is the same as the single anchor box version. However, This multiplicity introduces complexities in training the model, especially aligning predictions with the corresponding labels for the Objectness Loss.

Understanding the Complexity - Best Match

The assignment problem arises when we try to match these detections (Detection X, Y, and Z) to the ground truth labels (Label A, B, and C in the figure below).

embed - 2024-02-12T174000.063.svg

The issue is that the predictions are ordered by the network's output (i.e., Detection X is always the first set of 85 parameters out of 85×3=255 parameters, followed by Detection Y, and so on), but the ground truth labels have no inherent order.

In an ideal situation, each detection would perfectly correspond to one ground truth label (e.g., Detection X to Label B, Detection Y to Label A, and Detection Z to Label C). However, more often than not, if there are only two labels available, the question arises: which two out of the three detections should be matched to these two labels? This problem can be framed as a bipartite graph matching issue, typically solved by the Hungarian algorithm.

YOLO, however, opts for a different strategy, employing a greedy approach—potentially for performance reasons. Specifically, YOLO calculates the IoU score, which measures the overlap between each detection and all labels. For each detection, it assigns the label with which it has the highest IoU score, implying that the detection is responsible for predicting that particular label.

embed - 2024-02-12T175952.334.svg

A common issue that arises with this method is when a single label is assigned to multiple detections. For instance, Label A might be assigned to both Detection X and Detection Y.

embed - 2024-02-12T180119.225.svg

In such cases, we compare the IoU scores of Detection X and Detection Y with respect to Label A. If Detection X has a higher IoU score, then it is ultimately assigned Label A, while Detection Y is reclassified as unmatched. This process ensures that each label is associated with the detection that most closely aligns with it, thereby improving the precision of the detection task.

Understanding the Complexity - Undetection

If we continue to assume this matching process, we'll find that Label C is assigned to Detection Z. However, due to YOLO's single-step greedy comparison (remember, iterations of an indeterminate number of times should be used sparingly in neural networks, as continuous greediness implies indeterminate iterations), the label is left unassigned, meaning it is not matched by any detection.

embed - 2024-02-12T180503.924.svg

In such cases, accountability is required. In the current design, accountability is straightforward: Detection Y should be held responsible for not detecting Label B, as it is the only detection that failed to do so. This results in a loss through the False Negative (FN) Objectness Loss: $\mathbb{1}_{i,j}^{\text{obj}} (C_i - \hat{C}_i)^2 =1(1-0)$.

However, the situation is not so simple if we have only two labels and three detections match only one of them.

embed - 2024-02-12T181237.841.svg

In this scenario, a naive strategy (non-YOLO) would have both Detection Y and Z be responsible for not detecting Label B, sharing 50% of the responsibility (Loss) each. This design is more stable, but YOLO uses an optimal shape matching for accountability.