In the previous example, we have studied how using multiple templates (filter kernels) to match the local areas (receptive fields) in an image to impliment convolution and to generate similarity scores. The output scores will be stored in feature maps.

Nevertheless, the templates used so far have been 3*3, which is quite small. Such small templates cannot fully capture the common patterns in an image that generally require comparisons over more than a few dozens of pixels.

In deep CNN, this issue is addressed by stacking multiple convolutional layers. This not only allows for the extraction of features over larger areas but also serves other purposes, such as learning abstract concepts and performing nonlinear comparison measurements through the use of inner products and activation functions. While these additional functionalities are very deep in theory (and complex) and won't be discussed in detail here.

To acquire a more profound comprehension of CNNs, the most intuitive aspect lies in understanding the extraction of large-area features.

Why Deeper is about Broader Receptive Field?

To understand why a simple 3x3 template can extract large-area features, let's examine the operation of two stacked convolutional layers (with just a single template, and omitting the activation functions for simplicity).

embed (77).svg

In the figure, we are comparing parts of a larger image to templates. Look at the large grid on the left, which represents our image. We're focusing on two areas here: one highlighted in red and another in blue, both sized 3*3 grids.

Now, let's see how they match with Template A. We calculate how similar the red 3x3 area is to Template A and represent this similarity score with a single red square in the middle feature map. We do the same for the blue area, with its result shown as a blue square.

Similarly, for Template B, we take a yellow grid 3*3 area from the middle feature map, which corresponds to a larger 5*5 area in the original image (try it yourself to see how the feature map condenses information), and compare it to Template B. The result of this comparison is a single yellow square in the rightmost feature map.

Therefore, even though we see just a single yellow square in the feature map on the right, it actually holds the template matching information from the entire 5*5 yellow grid area of the original image. This means that one small square in the feature map gives us a summary or a 'similarity score' of a much larger area, after it's been compared to a template $\mathcal{W}$.

About $\mathcal{W}$ visualization:

Code (Not very useful)