Pooling and downsampling are essential techniques in CNN to address the computational challenges of processing high-resolution data, such as medical images or large-scale visual inputs. Convolution operations, though powerful, become computationally expensive as input dimensions grow, straining resources. Pooling mitigates this by summarizing local regions of feature maps into compact representations, drastically reducing spatial dimensions and subsequent processing demands.

Beyond efficiency, this dimensionality reduction introduces translation invariance—allowing networks to focus on broader patterns rather than exact feature locations—enhancing robustness for tasks like object recognition. Thus, pooling balances computational efficiency with improved generalization, making it indispensable in scaling CNNs for real-world applications.

Pooling

Pooling uses a non-trainable sliding window with fixed strides (step size) to systematically reduce the input size. A typical sliding window configuration is to use a kernel of size $2 \times 2$ pixels with a stride of $2$, grouping pixels within this area to be processed together, resulting in non-overlapping regions being processed at each step.

embed - 2025-02-23T151740.907.svg

Pooling aggregates pixel values within each window into a single number, reducing feature map dimensions and in consequence, decreasing the network's computational load for better efficiency.

Note: Pooling is a purely spatial processing module, applied independently to each channel.

Two common methods used in modern CNNs are average pooling and max pooling:

Note: A common misconception is that pooling layers reduce the number of parameters in the following convolutional layers. However, this is not true—pooling does not affect the parameters of convolutional layers. The number of parameters in convolutional layers is determined solely by the number and size of the kernels, not the feature map size. Instead, pooling influences the size of the feature maps, which in turn affects the computational cost of these convolution layers.

Downsampling Methods

In addition to pooling, methods like interpolation and signal downsampling with anti-aliasing filters are used for downscaling in image and signal processing. However, pooling is more computationally efficient and commonly used in neural networks.

Interpolation

Interpolation involves estimating unknown values within the range of a discrete set of known data points. In the context of image processing, interpolation calculates the pixel values of a new, resized image from the original pixel values. The operations vary by method but generally involve weighted averages of neighboring pixels.

Bilinear Interpolation: the value of a new pixel $P$ at position $(x, y)$ is calculated as a weighted average of the four nearest original pixels. If $P_1$, $P_2$, $P_3$, and $P_4$ are the closest original pixels around the new pixel position, and their weights based on distance to $P$ are $w_1$, $w_2$, $w_3$, and $w_4$, respectively, then $P$ can be estimated as:

$$ \begin{aligned} P &= w_{1}P_{1} + w_{2}P_{2} + w_{3}P_{3} + w_{4}P_{4}\\w_1 &= (x_2 - x)(y_2 - y)\\w_2 &= (x - x_1)(y_2 - y)\\w_3 &= (x_2 - x)(y - y_1)\\w_4 &= (x - x_1)(y - y_1)\end{aligned} $$

image.png

Here, $x$ and $y$ are the coordinates of the new pixel, and $x_1$, $x_2$, $y_1$, and $y_2$ are the coordinates of the corners of the pixel square (formualted by $P_1$, $P_2$, $P_3$, and $P_4$) in the original image that surrounds the point where the new pixel will be placed.

Downsampling

**Downsampling with Anti-Aliasing Filtering:** specifically when reducing the sampling frequency or image size, involves two main operations: filtering and decimation. The anti-aliasing filter operation is designed to remove high-frequency components from the signal or image that cannot be represented at the lower sampling rate, preventing aliasing. This can be represented by a convolution of the image or signal $S$ with a filter $F$, giving a filtered image or signal $S'$: