Single object localization focuses on identifying the precise location of one object in an image by producing a bounding box that specifies its position and size. This task, which is simpler than multi-object detection, is a prestep in mastering more complex applications. It often works alongside classification, taking an image as input to generate two outputs: a bounding box for localization and a label for classification.

Bounding Box: A rectangular box that encompasses the object of interest. It is defined by coordinates specifying its position and dimensions (e.g., yxhw).

Class: Similar to the image classification problem, this is represented by a vector where each position corresponds to the probability of that class.

Untitled

Single Object Localization

For single-object localization, we typically do not train a model from scratch but instead use a pre-trained image classification model. It is assumed that a well-trained classification model can identify the Region of Interest (ROI) in an image. By fine-tuning a pre-trained model, we can achieve faster and more accurate object localization, even with limited training data. Otherwise, training from scratch with insufficient data often leads to overfitting, resulting in poor performance during real-world application.

Bounding Box

In object detection tasks, a bounding box (i.e., bbox or box) is a rectangular frame used to define the location of an object within an image. This box is defined by coordinates that represent the object’s extent in the image plane. Bounding boxes are essential tools in computer vision as they allow models to understand where objects are located within a scene, enabling further tasks like object classification and tracking.

bbox format

bbox format yxhw

In computer vision, bounding boxes are typically represented using common formats such as tlbr (top-left-bottom-right) or yxhw (center y-coordinate, center x-coordinate, height, width). The tlbr format specifies the coordinates of the top-left corner and the bottom-right corner of the rectangle, making it straightforward to delineate the box on the image. Alternatively, the yxhw format describes the bounding box by its center's coordinates along with its height and width, which can be particularly useful for transformations like scaling and rotating.

Localization Architecture

After selecting a pre-trained model, such as ResNet50, we need to fine-tune the output layer. Typically, the final layers of the pre-trained model, which are designed for classification, are either replaced or adjusted to predict bounding box coordinates. As show the figure below on the right.

For a single object, this usually involves replacing the last outputs layer (n_class) with a new one that outputs 5 values, where 4 of them represent the bounding box (e.g., yxhw) and 1 represents the confidence score. For a localization and classification model, we need n_class + 5.

embed - 2025-01-27T201508.143.svg

Localization Loss Functions

For localization, we need to evaluate whether the model's predicted bounding box is accurate. Additional loss functions are required to guide the training process.

Bounding Box Regression Loss: Bounding box accuracy can be evaluated using a regression loss, like MSE, which measures the discrepancy between predicted and ground truth center points $(x, y)$, height $h$, and width $w$. The bounding box is visualized as showing on the right.

The MSE loss for the predicted and ground truth bounding boxes is defined as:

$$ J^{(i)}_{\text{bbox}} = ((x^{(i)} - \hat{x}^{(i)})^2 + (y^{(i)} - \hat{y}^{(i)})^2 + (w^{(i)} - \hat{w}^{(i)})^2 + (h^{(i)} - \hat{h}^{(i)})^2) $$

where $N$ is the number of training examples, and $\hat{y}$, $\hat{x}$, $\hat{h}$ and $\hat{w}$ are the predicted values.

Presence Check: Usually, localization comes with an additional object presence score (a.k.a. confidence score). It is necessary to have this score, as it helps the model distinguish between the pure background and actual objects, ensuring accurate detection by penalizing false positives and false negatives.

Object presence usually uses binary cross-entropy loss (the loss function used in logistic regression), which ensures that the model is confident about whether an object is present.

$$ J^{(i)}_{\text{score}} = [c^{(i)} \log(\hat{c}^{(i)}) + (1 - c^{(i)}) \log(1 - \hat{c})] $$

where $c^{(i)}$ is the ground truth label indicating the presence of an object, and $\hat{c}^{(i)}$ is the predicted probability of presence.