In the previous discussion, we explored the application of multiple kernels, or filter kernels, to analyze localized segments, referred to as receptive fields, within an image through the process of convolution, thereby generating similarity scores. These scores are subsequently captured in what are known as activation/feature maps.
In the discussion of filter kernel sizes, it is important to mention the insights derived from the VGG network. The VGG architecture predominantly utilizes $3\times3$ filter kernels across all its layers.
Despite their small size, when these kernels are stacked in multiple convolutional layers as seen in VGG, they enable the network to effectively capture larger and more complex patterns within images. This is achieved by the cumulative enlargement of the receptive field through sequential convolutional layers, allowing the network to integrate information from increasingly larger areas of the input image while maintaining a fine-grained understanding of the image structure.
To understand why a simple $3\times3$ kernel can extract large-area features, let's examine the operation of two stacked convolutional layers (the first layer uses kernel $\mathbf{K}{1}$, the second layer uses kernel $\mathbf{K}{2}$, and omitting the activation functions for simplicity).
In the figure, the large grid on the left is our input image, where we highlight two $3\times3$ patches in red and blue. When applying the first convolutional layer, each $3\times3$ patch in the input is compared to the first layer kernel $\mathbf{K}_{1}$, yielding a single “similarity score.” These scores form the middle feature map: the red square corresponds to the red patch, and the blue square corresponds to the blue patch.
Next, the second convolutional layer again uses another $3\times3$ kernel $\mathbf{K}_{2}$ but now it operates on the middle feature map. Notice that a $3\times3$ region in this feature map corresponds to a larger $5\times5$ region in the original image. Consequently, a single neuron in the final feature map—like the green one in the figure—encodes the matching information for that entire $5\times5$ area.
Take Away: This explains why deeper networks, even with small kernels, expand their effective receptive field: a single value in a deeper layer effectively summarizes a much larger portion of the original input.