Receptive Field and Convolution

The concept of the receptive field, originating from neuroscience, plays a crucial role in this process. It refers to a specific area within the sensory space, such as the visual field, where stimulation can elicit a response from a sensory neuron.

Levine and Shefner (1991) define a receptive field as:

"an area in which stimulation leads to response of a particular sensory neuron".

This understanding helps us design systems that process visual information in a more human-like manner.

Schematic of the receptive field of a pain in the skin.

In convolutional neural networks (CNNs), each neuron's receptive field is similar to a human neuron, responsible for a specific area of perception within an image.

Partial Perception Suffices

In a standard fully connected layer, each set of weights is designed to match the input size, capturing global information across the entire input. However, this comprehensive approach may not always be necessary. Take, for example, the task of identifying bird species, where distinct features such as beak shape, eye color, and claw structure are important but are analyzed independently. When focusing on specific attributes like the rings around a bird's eyes, other features such as the beak or claw color become irrelevant to that particular analysis, though they remain crucial for the overall identification process. This indicates a potential inefficiency in using fully connected layers for tasks requiring targeted feature analysis.

Untitled

This means that the neural network can focus on specific local regions of an image, such as the eyes or claws in bird recognition, rather than analyzing the entire image. By limiting processing to smaller receptive fields, this approach significantly reduces computational costs while preserving essential feature extraction.

Note: in the text below, each segment or window denotes the receptive field, as the model will process each of them independently without considering information from others.

Slicing Image to Segments

A straightforward way to implement receptive fields is by dividing a large image into smaller, non-overlapping segments. The neural network will process each segment without considering other.

For example, as shown in the image on the right, a $320\times240$ image is divided into $6$ rows and $8$ columns, with each segment measuring $40\times40$ pixels.

embed (98).svg

Parameter Sharing: After segmentation, a fully connected layer is applied to each segment. At this point, there is a decision to be made: whether to use distinct parameters for each segment or to apply the same parameters across all segments.

Ultimately, the choice is yours.

In CNNs, it is common practice for all segments to share the same parameters, a technique known as parameter sharing. This approach enables the network to learn feature representations that are applicable across different regions of the input, promoting efficiency and generalization.

For instance, when recognizing a bird's beak, the shape remains consistent, but its position within the image may vary. By using shared parameters, the network can effectively detect the beak in different locations without needing separate parameters for each position.

New Project (1) (1) (1).png

embed - 2025-02-20T212136.245.svg

As shown in the figure above, we use one set of shared parameters $\mathbf{w}$ and $b$ for the fully connected computation across all segments. This results in a $6 \times 8$ 2D array on the right, which is typically referred to as the activation map (or feature map). The segment containing the beak will yield higher values in the corresponding output array compared to other areas. In this section, we refer to it as the activation map because we are focusing on the similarity between the segments and the parameters: if they are similar, the segment activates; otherwise, it does not.

In practice, different patterns need to be captured, therefore, it is more often to have multiple sets of parameters $\mathbf{w}, b$ that are all shared by all segments.

Recal Terminology: The parameters used to recognize a pattern are typically called kernels. The number of kernels determines the number of output activation maps, each corresponding to the detection of a specific pattern.

Issue: This non-overlapping segmentation approach has a drawback. When meaningful content, like a bird's eye, is split across adjacent segments, the network may have difficulty recognizing it as a single object. Processing these segments independently makes it challenging for the network to capture the full context of the object.

embed (98).svg

Partial Perception Suffices

Slicing Image to Segments

Overlapping Segments