Pooling and downsampling are essential techniques in CNN to address the computational challenges of processing high-resolution data, such as medical images or large-scale visual inputs. Convolution operations, though powerful, become computationally expensive as input dimensions grow, straining resources. Pooling mitigates this by summarizing local regions of feature maps into compact representations, drastically reducing spatial dimensions and subsequent processing demands.
Beyond efficiency, this dimensionality reduction introduces translation invariance—allowing networks to focus on broader patterns rather than exact feature locations—enhancing robustness for tasks like object recognition. Thus, pooling balances computational efficiency with improved generalization, making it indispensable in scaling CNNs for real-world applications.
Pooling uses a non-trainable sliding window with fixed strides (step size) to systematically reduce the input size. A typical sliding window configuration is to use a kernel of size $2 \times 2$ pixels with a stride of $2$, grouping pixels within this area to be processed together, resulting in non-overlapping regions being processed at each step.
Pooling aggregates pixel values within each window into a single number, reducing feature map dimensions and in consequence, decreasing the network's computational load for better efficiency.
Note: Pooling is a purely spatial processing module, applied independently to each channel.
Two common methods used in modern CNNs are average pooling and max pooling:
Average pooling calculates the mean of pixel values within a sliding window on the feature map. In the illustration below, each $2\times 2$ colored box represents a window, and the average of the 4 values inside each window is computed and stored as a scala.
For instance, the initial window (highlighted in red at the top left corner) averages the numbers $[1,2;4,5]$, resulting in a scala valued $3$.
Average pooling functions as a low-pass filter by averaging values, helping preserve general structural and textural information. However, it cannot retain extreme values and is criticized for reducing high-frequency details due to its inherent low-pass filtering nature.
Max pooling selects the highest value from each sliding window, emphasizing the most prominent features within the window.
For instance, in the red window $[1,2,4,5]$, the maximum value is 5. Max pooling retains this maximum value while discarding the others in the red window. This operation introduces a degree of structural randomness, as the positions of the maximum values vary across different windows.
Further Read: This "randomness" in the interval sampling does not adhere to the Nyquist–Shannon Sampling Theorem requirements; instead, it boosts the retention of high-frequency details even when fewer sampling points are used. However, this method also inadvertently adds noise to the low-frequency components, creating a trade-off. While it excels at highlighting important visual information, the introduction of noise can slightly obscure the image's smoother, less detailed areas, impacting overall clarity.
Note: A common misconception is that pooling layers reduce the number of parameters in the following convolutional layers. However, this is not true—pooling does not affect the parameters of convolutional layers. The number of parameters in convolutional layers is determined solely by the number and size of the kernels, not the feature map size. Instead, pooling influences the size of the feature maps, which in turn affects the computational cost of these convolution layers.
In addition to pooling, methods like interpolation and signal downsampling with anti-aliasing filters are used for downscaling in image and signal processing. However, pooling is more computationally efficient and commonly used in neural networks.
Interpolation involves estimating unknown values within the range of a discrete set of known data points. In the context of image processing, interpolation calculates the pixel values of a new, resized image from the original pixel values. The operations vary by method but generally involve weighted averages of neighboring pixels.
Bilinear Interpolation: the value of a new pixel $P$ at position $(x, y)$ is calculated as a weighted average of the four nearest original pixels. If $P_1$, $P_2$, $P_3$, and $P_4$ are the closest original pixels around the new pixel position, and their weights based on distance to $P$ are $w_1$, $w_2$, $w_3$, and $w_4$, respectively, then $P$ can be estimated as:
$$ \begin{aligned} P &= w_{1}P_{1} + w_{2}P_{2} + w_{3}P_{3} + w_{4}P_{4}\\w_1 &= (x_2 - x)(y_2 - y)\\w_2 &= (x - x_1)(y_2 - y)\\w_3 &= (x_2 - x)(y - y_1)\\w_4 &= (x - x_1)(y - y_1)\end{aligned} $$
Here, $x$ and $y$ are the coordinates of the new pixel, and $x_1$, $x_2$, $y_1$, and $y_2$ are the coordinates of the corners of the pixel square (formualted by $P_1$, $P_2$, $P_3$, and $P_4$) in the original image that surrounds the point where the new pixel will be placed.
**Downsampling with Anti-Aliasing Filtering:** specifically when reducing the sampling frequency or image size, involves two main operations: filtering and decimation. The anti-aliasing filter operation is designed to remove high-frequency components from the signal or image that cannot be represented at the lower sampling rate, preventing aliasing. This can be represented by a convolution of the image or signal $S$ with a filter $F$, giving a filtered image or signal $S'$: