Receptive field is a concept from neuroscience that refers to the specific region of sensory space, such as the visual field.
Levine and Shefner (1991) define a receptive field as an "area in which stimulation leads to response of a particular sensory neuron".
Schematic of the receptive field (RF) of a pain in the skin.
In the context of deep learning, especially in convolutional neural networks (CNNs), the receptive field represents the region of the input data (e.g., an image) that influences a particular neuron in a given layer. It determines how much of the input the network considers at once, impacting the model's ability to capture spatial hierarchies and dependencies. A larger receptive field allows the model to recognize broader patterns and contextual information, while a smaller one focuses on finer details.
In a standard fully connected layer, each set of weights matches the input size, meaning it captures global information. But is this always necessary?
Consider bird species identification, where distinguishing features like beak shape, eye color, and claw structure are analyzed individually. For instance, when examining the rings around a bird's eyes, details like claw color or beak shape are irrelevant.
It means neural networkds only need to analyze specific local areas of an image (i.e., in a small receptive field), such as the eyes or claws in bird recognition, rather than processing the entire image. This approach allows the network to efficiently capture relevant features for detection.
Note: in the text below, each segment or window denotes the receptive field, as the model will process each of them without considering information from others.
A straightforward way to implement receptive fields is by dividing a large image into smaller, non-overlapping segments. The neural network will process each segment without considering other.
For example, as shown in the image on the right, a 320x240 image is divided into 6 rows and 8 columns, with each segment measuring 40x40 pixels.
However, this approach has a drawback. If meaningful content is split across multiple segments—such as the bird's eye being divided between two adjacent segments—the network may struggle to recognize it as a single object. Processing these segments separately makes it challenging for the network to understand the full context of the object.
Shared Parameters: After segmentation, we use a fully connected layer for each segment. We face a choice: whether to use a separate fully connected layer for each segment or the same one for all. In other words, for $M$ segments, do we use $M$ different fully connected layers, one per segment, or the same layer for all segments? The choice is yours.
However, in most neural network applications, particularly in convolutional neural networks, it is common for $M$ segments to share the same fully connected layer, a practice referred to as using shared parameters.
Shared paramters for all sliding windows are used because identical patterns, such as a bird's beak, can appear in different positions within an image, maintaining the same shape but varying in location.
Note: The use of shared parameters serves as the default configuration for all subsequent discussions regarding convolutions.
A simple-yet-effective solution is to introduce overlapping segments. In the figure on the right, an additional set of blue segments is used to overlap with the existing red ones.
This approach ensures that important object, such as the bird's eye, are fully captured within at least one segment—such as in the blue set. While the beak may not be fully captured in a blue segment, it is adequately contained within a red one.
The neural network will process each segment including red and blue ones, and similarly when processing each segment, others are not considered.
This overlapping strategy, known as "swin," may be explored further in the transformer section if time permits.
You might wonder what happens if neither the blue nor red segments fully capture an important object. The solution is straightforward—introducing more overlapping segments with a smaller shift, such as 1/4 of the segment's width or height, can help ensure better coverage of key features. However, this approach comes with additional costs, as every segment needs to be processed; therefore, more segments mean increased computations.