R-CNN, introduced by Ross Girshick et al. in 2014, marked a significant milestone in the evolution of object detection frameworks, setting the stage for its successors, Fast R-CNN and Faster R-CNN. These models epitomize the two-stage approach to object detection.

Before R-CNN, object detection depended on hand-crafted features and exhaustive sliding window searches, which were both computationally heavy and less accurate. R-CNN revolutionized the field by introducing Convolutional Neural Networks, allowing automatic learning of rich, hierarchical features from large datasets. By using region proposals to focus on likely object areas, R-CNN enhanced both efficiency and precision in detection. This innovative integration of deep learning with selective region processing overcame the limitations of traditional methods, paving the way for more advanced and scalable object detection frameworks.

R-CNN Inference

In Inference, R-CNN starts by generating region proposals—specific areas in the image that are likely to contain objects, using a rule-based approach. Each proposed region is then independently processed by a neural network to extract features. Subsequently, another machine learning model uses these features to determine whether each region contains an object.

Untitled

Region Proposal

Region proposal is the initial stage in the R-CNN architecture, where a rule-based algorithm called selective search is applied to an image to propose around 2000 candidate boxes that are likely to contain objects. These proposed boxes identify potential object locations but do not determine the object's class or guarantee exact positional accuracy. Fast R-CNN and Faster R-CNN advance this by applying a neural network to streamline the region proposal process, improving efficiency and accuracy.

embed (28).svg

The output from the selective search consists of candidate boxes described by a 4-tuple $[x, y, w, h]$, representing the box's coordinates and size: $x$ and $y$ are the starting point, usually top-left, while $w$ and $h$ are its width and height. These outputs are in an unordered sequence, meaning the boxes are not sorted by size, position, or likelihood of containing a significant object, and require further processing to organize them for tasks like matching with ground truth labels.

R-CNN Object Detection Pipeline

R-CNN is not an end-to-end neural network but a hybrid of different mathods. However, the majority of its rule-based algorithms and traditional machine learning models can be replaced with neural networks for enhanced performance.

embed - 2025-01-27T230616.235.svg

  1. A candidate box is selected from the image, producing a cropped section $x_{\text{crop}}$, which contains a part of the image where an object might be present. This cropped image is resized to a standard size, typically 224x224 in R-CNN, to ensure consistency in input size for CNNs.
  2. After resizing, each cropped image $x_{\text{crop}}$ is fed into a pretrained neural network model, such as VGG16 or ResNet50, for feature extraction. The model acts as an effective feature detector, producing a feature map as its output. This feature map is then typically flattened into a vector to facilitate the following classification and regression tasks.
  3. Once the features are extracted, they are used in two parallel processes:
    1. An SVM classifier takes the features to predict the class of the object. The output is the class label, which signifies the type of object detected within the candidate box.
    2. A linear regressor is applied to the extracted features to refine the coordinates of the bounding box. This regression step adjusts the position of the candidate box, resulting in the output of corrected bounding box positions.

Note: All images utilize the same feature extraction model, as well as the SVM and regressor because this ensures consistent processing across different inputs. This uniformity is crucial for maintaining the accuracy and reliability of the object detection system. By using standardized models, the system can efficiently generalize from one image to another, optimizing both the detection and localization tasks across varied datasets.

Loss Challenge: Align 2000 Candidate Boxs with Less GT Boxes

We must now address the issue concerning the loss function that was previously mentioned. Suppose an algorithm generates 2000 candidate boxes, but there are only $Q$ ground truth boxes available. The challenge is to effectively associate these candidate boxes with the correct labels.

R-CNN resolves this imbalance by employing the Intersection over Union (IoU) metric. This metric is used to evaluate each candidate box and assign it either a corresponding ground truth label or none, indicating the box does not detect any object. This approach ensures that each candidate box is accurately labeled, managing the disproportion between the numerous proposed candidate boxes and the limited number of ground truth labels effectively.

embed - 2024-02-19T151905.006.svg