R-CNN, introduced by Ross Girshick et al. in 2014, marked a significant milestone in the evolution of object detection frameworks, setting the stage for its successors, Fast R-CNN and Faster R-CNN. These models epitomize the two-stage approach to object detection.
Before R-CNN, object detection depended on hand-crafted features and exhaustive sliding window searches, which were both computationally heavy and less accurate. R-CNN revolutionized the field by introducing Convolutional Neural Networks, allowing automatic learning of rich, hierarchical features from large datasets. By using region proposals to focus on likely object areas, R-CNN enhanced both efficiency and precision in detection. This innovative integration of deep learning with selective region processing overcame the limitations of traditional methods, paving the way for more advanced and scalable object detection frameworks.
In Inference, R-CNN starts by generating region proposals—specific areas in the image that are likely to contain objects, using a rule-based approach. Each proposed region is then independently processed by a neural network to extract features. Subsequently, another machine learning model uses these features to determine whether each region contains an object.
Region proposal is the initial stage in the R-CNN architecture, where a rule-based algorithm called selective search is applied to an image to propose around 2000 candidate boxes that are likely to contain objects. These proposed boxes identify potential object locations but do not determine the object's class or guarantee exact positional accuracy. Fast R-CNN and Faster R-CNN advance this by applying a neural network to streamline the region proposal process, improving efficiency and accuracy.
The output from the selective search consists of candidate boxes described by a 4-tuple $[x, y, w, h]$, representing the box's coordinates and size: $x$ and $y$ are the starting point, usually top-left, while $w$ and $h$ are its width and height. These outputs are in an unordered sequence, meaning the boxes are not sorted by size, position, or likelihood of containing a significant object, and require further processing to organize them for tasks like matching with ground truth labels.
R-CNN is not an end-to-end neural network but a hybrid of different mathods. However, the majority of its rule-based algorithms and traditional machine learning models can be replaced with neural networks for enhanced performance.
Note: All images utilize the same feature extraction model, as well as the SVM and regressor because this ensures consistent processing across different inputs. This uniformity is crucial for maintaining the accuracy and reliability of the object detection system. By using standardized models, the system can efficiently generalize from one image to another, optimizing both the detection and localization tasks across varied datasets.
We must now address the issue concerning the loss function that was previously mentioned. Suppose an algorithm generates 2000 candidate boxes, but there are only $Q$ ground truth boxes available. The challenge is to effectively associate these candidate boxes with the correct labels.
R-CNN resolves this imbalance by employing the Intersection over Union (IoU) metric. This metric is used to evaluate each candidate box and assign it either a corresponding ground truth label or none, indicating the box does not detect any object. This approach ensures that each candidate box is accurately labeled, managing the disproportion between the numerous proposed candidate boxes and the limited number of ground truth labels effectively.