Loss Function Practice

Let's consider a practice example using a 5x5 grid to detect a cat and a dog within an image. In this scenario:

The image is divided into a 5x5 grid, resulting in 25 cells.
There is one dog in the right part of the image, its centre point primarily falls within one grid cell.
There is one cat in the left part of the image, also its centre point primarily falls within one grid cell.

embed - 2024-02-12T163423.185.svg

Step 1: Grid Cell Assignment

Dog: Let's say the center of the dog falls into cell (3, 4) (using 1-based indexing). This cell is responsible for detecting the dog and predicting its bounding box and class.
Cat: The center of the cat falls into cell (4, 2). This cell is responsible for detecting the cat and predicting its bounding box and class.

Step 2: Expected Outputs from the Neural Network and Loss Calculation

For Dog Cell (3, 4), we expect 1) a high objectness score, 2) coordinates for the bounding box that accurately enclose the dog, and 3) a high probability for the "dog" class.

The loss calculation for this cell is represented by $\mathcal{L}(j)$, where $j=1$ corresponds to the cell containing the first labeled object -dog- along with its bounding box. The loss encompasses
- Box Regression Loss: Calculate the difference between the predicted and actual bounding box coordinates and sizes. Assume the ground truth bounding box for the label dog object is $(x, y, w, h)$ and the prediction is $(\hat{x}, \hat{y}, \hat{w}, \hat{h})$ calculated from model’s output $[b_x, b_y, b_h, b_w]$.
- Objectness Loss: Since the cell contains an object, the objectness score should be close to 1. The loss is calculated based on the difference between the predicted objectness score ($\hat{C}_{3,4}$) and the ground truth (1).
- Classification Loss: If "dog" is class 1 and "cat" is class 2, then $p_{3,4}(\text{dog}) = 1$ and $p_{3,4}(\text{cat}) = 0$. The loss is calculated based on the difference from the predictions $\hat{p}{3,4}(\text{dog})$ and $\hat{p}{3,4}(\text{cat})$.
For Cat Cell (4, 2), we expect 1) a high objectness score, 2) coordinates for the bounding box that accurately enclose the cat, and 3) a high probability for the "cat" class.
For All Other Cells, we only expect the low objectness scores indicating no object center is present within those cells. There should be no need for accurate bounding box coordinates or class probabilities.
- No-Object Loss: These cells should have low objectness scores. The loss is calculated based on the objectness scores $(C_i - \hat{C}i)^2$, scaled by $\lambda{\text{noobj}}$ for cells without the center of an object.
- Box Regression and Classification Loss: Not directly applicable, since $\mathbb{1}_{i,j}^{\text{obj}} = 0$ for these cells. I.e., no matter what is the output of the $[b_x, b_y, b_h, b_w]$ and the classification result, their loss is always 0.

Guiding Training using the Loss:

Positive Predictions: Ensure cells (3, 4) and (4, 2) accurately predict their respective objects' presence, sizes, and classes.
Negative Predictions: Ensure all other cells correctly predict the absence of objects, minimizing false positives.
Adjustments: Based on the loss, adjust the model's parameters using backpropagation. If the loss for certain components (e.g., box regression) is high, focus on improving those predictions in the next iterations.