YOLO can be implemented using various convolutional neural network architectures. A popular choice historically has been Darknet, though recent adaptations of YOLO have incorporated transformer architectures to enhance performance. A significant challenge in developing YOLO from scratch is the complexity of its loss function. This complexity primarily arises from the need to reconcile the mismatch between the model outputs and the predefined labels. Typically, object labels are provided in the COCO JSON format, which is a standard for object detection tasks.

Example of a simple COCO JSON (LTRB)

Based on this COCO JSON, we can see that we cannot use traditional MSE or NLL loss to handle the mismatch between model outputs and labels. This is because, even after processing, the COCO JSON provides only the bounding box location information and the categories, whereas the model outputs 85 numeric feature maps.

The loss function of the YOLO model defines how to align the information in the labels with the corresponding feature maps and calculate the loss.

To simplify: As the fully expanded formula is too complex to be easily readable and understandable, we can simply consider the feature map's $b_x, b_y, b_w, b_h$ as the target's bounding box $[x, y, h, w]$. In programming practice, you will see that they are about the anchor box through the output of a transformation i.e., $[x, y, h, w] = f_{b_x, b_y, b_w, b_h}(x_a, y_a, w_a, h_a)$. At the same time, we assume that the model predicts only one bounding box for each cell. Details about a cell predicting multiple bounding boxes will be discussed in later section.

Plausible Way of Thinking Loss Function

We can understand YOLO's loss function by mapping COCO JSON to the YOLO model's output feature map. For example, suppose we have an object Dog at position $[50, 50, 100, 100]$ in a $320\times 320$ image and another object Cat at position $[150, 150, 200, 200]$.

We can calculate the corresponding cells of these two objects in the model's output feature map and then transform and assign the positional values to the respective cells.

  1. Calculate the center point of each object and determine which cell on a $10 \times 10$ feature map (each cell represents a $32\times 32$ pixel area of the original image) it corresponds to. The center points are $[75, 75]$ for the first object and $[175, 175]$ for the second object, which map to cells $[75, 75] // 32 = [2, 2]$ and $[175, 175] // 32 = [5, 5]$ respectively.

  2. Taking cell $[2, 2]$ as an example, it contains 85 values represented as follows:

    $$ \mathbf{y}[:, 2, 2] = [75, 75, 100, 100, 1, 0, 1, ...] $$

    Here, the first four values $[75, 75, 100, 100]$ represent the bounding box information $[x, y, w, h]$; the subsequent $1$ indicates the presence of an object in this cell; the series starting with $[0, 1, \dots]$ corresponds to the classification labels, where the value $1$ at the second position signifies that the object is a Dog.

  3. For cell $[5, 5]$, it contains 85 values as shown below:

    $$ \mathbf{y}[:, 5, 5] = [150, 150, 200, 200, 1, 1, 0, ...] $$

    The classification label $[1, 0, …]$ indicates that the object is a Cat. The assignment of these classification labels, such as the first value denoting a Dog and the second a Cat, is predefined according to the dataset configuration and should be consistent with your labeling schema.

  4. For all other cells, each one contains 85 values as shown below:

    $$ \mathbf{y}_{\neq[(2, 2), (5, 5)]} = [0, 0, 0, 0, \underline{0}, 0, 0, ...] $$

    The first 4 values and the last 80 values in the array can be set to any values without impact, as they are irrelevant when the fifth value is set to 0. This fifth value serves as an indicator; when it is set to 0, it signifies that there is no object present in the cell, thus rendering the bounding box coordinates and classification labels unnecessary for the model's calculations.

YOLO Loss Overview

There are 3 parts in the YOLO loss,

$$ J^{(i)} = \lambda_1 J^{(i)}{\text{bbox}} + \lambda_2 J^{(i)}{\text{score}} + \lambda_3 J^{(i)}_{\text{class}} $$

These 3 parts seem to be the same as the previous single-object localization and classification. This is both correct and incorrect. It is correct because YOLO can be understood as a derivation of the previous single-object approach applied to multiple objects, so the structure is similar. However, each cell in YOLO needs to predict one box (or multiple boxes, through subsequent expansions), whereas the previous single-object approach only needed to predict one box in total. In other words, the previous single-object approach can be understood as a special case of YOLO, where the entire image is just one cell. Below, we will discuss the loss for each part step by step.

Bounding Box Loss

To streamline the mathematical formulation, we focus exclusively on the loss calculation for each individual cell. Subsequently, we can aggregate these losses across all cells to compute the total loss.

The first four outputs of the model are $[\hat x, \hat y, \hat w, \hat h]$, and the first four labels are $[x, y, w, h]$. This bounding box loss for a sample $i$ across all its cells $S_W \times S_H$ (the cell indices are ignore for simplicity) is calculated via a squared error function, as detailed below:

$$ \begin{align*} J^{(i)}{\text{bbox}} = \mathbb{1}^{(i)}{\text{obj}} \left[ (x^{(i)} - \hat{x}^{(i)})^2 + (y^{(i)} - \hat{y}^{(i)})^2 \right. \left. + (\sqrt{w^{(i)}} - \sqrt{\hat{w}^{(i)}})^2 + (\sqrt{h^{(i)}} - \sqrt{\hat{h}^{(i)}})^2 \right] \end{align*} $$