YOLO, (You Only Look Once) was introduced by Joseph Redmon et al. in 2016 as a groundbreaking shift in the paradigm of object detection. Unlike its predecessors that adopted a two-stage approach, YOLO revolutionized the field with its one-stage detection system, offering a significant boost in speed without substantially sacrificing accuracy.

Traditional object detection methods like R-CNN relied on multi-stage, computationally intensive pipelines that were too slow for real-time applications. YOLO was designed to overcome these limitations by treating detection as a single regression problem within a unified convolutional neural network. This approach enables extremely fast and accurate processing by predicting multiple bounding boxes and class probabilities simultaneously across a grid. By streamlining the detection process into an end-to-end trainable model, YOLO enhances both efficiency and scalability, making it well-suited for real-time applications and addressing the inefficiencies of previous multi-stage methods.

YOLO Inference

In the YOLO model, the inference process starts with an input image $\mathbf{x}$ having dimensions $H \times W$. This image is processed by a neural network which produces feature maps $\hat{\mathbf{y}}$ of dimensions $S_H \times S_W$. Typically, there is a fixed ratio between $H \times W$ and $S_H \times S_W$, such as $32$. For example, if the input size is $320 \times 320$, the output feature maps would have dimensions of $10 \times 10$. The process is illustrated in the following figure.

embed - 2025-01-29T001230.521.svg

This step can be viewed as dividing the original image into $10 \times 10$ grid cells where each cell measures $32 \times 32$ pixels. Each cell corresponds to an output pixel in the feature map. This pixel is tasked with detecting one or more objects whose centers are located within its specific grid cell, such as the dog.

YOLO's Output Feature Maps

Each output pixel contains (i.e., grid cell outputs) an 85-dimensional feature vector. The first four dimensions of this vector specify the coordinates and size of the bounding box ($b_x, b_y, b_w, b_h$) that potentially encloses an object. Due to the effective range limits of the neural network's outputs (neural networks are usually more effective in outputing range -1 to 1), these 4 values generally define a box transformation function. This function transforms a predefined box known as an anchor box, adjusting it into a bounding box that accurately matches the position and size of the target object.

The fifth dimension $p_{\text{obj}}$ indicates the likelihood of an object being present within this box (objectness score), and the remaining 80 dimensions provide the probabilities that the detected object belongs to each of the different classes.

Cells in the grid are tasked with identifying objects whose centers fall within their boundaries and distinguishing background areas where no objects are present. When a cell detects an object, it outputs accurate bounding box coordinates $b_x, b_y, b_w, b_h$, a high objectness score $p_{\text{obj}}$, and high probability for the relevant class, setting the probabilities for other classes to zero. Conversely, cells that detect only background should output a low objectness score $p_{\text{obj}}$. Other values in these cells are generally less critical, as their primary function is to confirm the absence of objects within their boundaries.

Bounding Box Transformation and Anchor Box

In YOLO, anchor boxes are preset boxes with specified dimensions used across every grid cell. Each grid cell typically employs between one to five anchor boxes. For simplicity, we just consider one anchor box for each cell.

Note: The dimensions of anchor boxes are predetermined and consistent across all grid cells. Each anchor box is centered at the grid cell's midpoint, with its height and width defined based on prior knowledge of typical object sizes in the dataset.

To simplify the visual representation in the figure below, we depict just one dark purple anchor box in each red grid cell.

embed - 2024-02-12T145325.340.svg

Anchor boxes streamline the initial prediction process by providing a reference framework that the network adjusts to fit the actual observed objects. These adjustments are carried out by a transformation function that uses the parameters $[b_x, b_y, b_h, b_w]$ from the 85-dimension output vector of the neural network. This vector essentially repositions and resizes the anchor boxes based on the following model:

$$ \text{prediction box}=f(\text{anchor box}, [b_x, b_y, b_h, b_w]) $$

This function shifts the anchor boxes horizontally and vertically with $b_x$ and $b_y$, aligning them to the detected object's central position. The parameters $b_h$ and $b_w$ are used to scale the anchor box in both dimensions to encapsulate the size of the detected object more accurately.

Further Discussion: the transformation from neural network outputs to bounding box predictions includes four critical computations: the center coordinates $(\hat{x}, \hat{y})$ of the bounding box are determined as $\hat{x} = \sigma(b_x) + c_x$ and $\hat{y} = \sigma(b_y) + c_y$, where $\sigma$ represents the sigmoid function to constrain the outputs to a unit scale, and $(c_x, c_y)$ denote the coordinates of the top-left corner of the grid cell. The width and height $(\hat{w}, \hat{h})$ are dynamically adjusted from the anchor box dimensions $(p_w, p_h)$ through the formulas $\hat{w} = p_w \cdot e^{b_w}$ and $\hat{h} = p_h \cdot e^{b_h}$.

Multiple Non-Overlap Objects in a Single Image