Pooling is a technique to adjust the size feature maps, typically reducing them to smaller scales. This adjustment enhances the overall computational efficiency of neural networks.

Problem Statement: In Convolutional Neural Networks (CNNs), the computational cost of convolution operations is significant and can be expressed as $O(h w c_{\text{in}} k^2 c_{\text{out}})$. Here, $h$represents the height of the input image, $w$ is the width, $c_{\text{in}}$ is the number of input channels, $k$ is the kernel size, and $c_{\text{out}}$ is the number of output channels. Since $k^2$ is constant, we can denote it as $K$. The magnitude of $c_{\text{in}}$ and $c_{\text{out}}$ typically ranges in the tens (common values are 32 or 64, and larger networks may have up to 256). The values of $h$ and $w$ are generally determined by the input image dimensions. For toy examples like MNIST and CIFAR10, the images are quite small (28x28 or 32x32), but for real-world images, such as medical MRI images, the pixel dimensions can run into the hundreds or even thousands in both directions. This can lead to very large values for $w$ and $h$, resulting in a significantly high computational load.

Down Scaling Methods Other than Pooling

In image processing, a plethora of methods exists for both interpolating (image processing) and downsampling (signal processing) images. Pooling, a technique commonly used in neural networks, stands apart from these traditional methods.

Interpolation involves estimating unknown values within the range of a discrete set of known data points. In the context of image processing, interpolation calculates the pixel values of a new, resized image from the original pixel values. The operations vary by method but generally involve weighted averages of neighboring pixels. For example, in bilinear interpolation, the value of a new pixel $P$ at position $(x, y)$ is calculated as a weighted average of the four nearest original pixels. If $P_1$, $P_2$, $P_3$, and $P_4$ are the closest original pixels around the new pixel position, and their weights based on distance to $P$ are $w_1$, $w_2$, $w_3$, and $w_4$, respectively, then $P$ can be estimated as:

$$ \begin{aligned} P &= w_{1}P_{1} + w_{2}P_{2} + w_{3}P_{3} + w_{4}P_{4}\\w_1 &= (x_2 - x)(y_2 - y)\\w_2 &= (x - x_1)(y_2 - y)\\w_3 &= (x_2 - x)(y - y_1)\\w_4 &= (x - x_1)(y - y_1)\end{aligned} $$

Here, $x$ and $y$ are the coordinates of the new pixel, and $x_1$, $x_2$, $y_1$, and $y_2$ are the coordinates of the corners of the pixel square (formualted by $P_1$, $P_2$, $P_3$, and $P_4$) in the original image that surrounds the point where the new pixel will be placed.

Downsampling with anti-aliasing filtering, specifically when reducing the sampling frequency or image size, involves two main operations: filtering and decimation. The anti-aliasing filter operation is designed to remove high-frequency components from the signal or image that cannot be represented at the lower sampling rate, preventing aliasing. This can be represented by a convolution of the image or signal S with a filter $F$, giving a filtered image or signal $S'$:

$$ S' = S * F $$

where $S * F$ denotes the convolution operation. After filtering, decimation reduces the sampling rate by selecting every Nth sample from $S'$, effectively reducing the data size. If $D$ is the decimated signal or image, and $n$ is the new sampling index, then:

$$ D[n] = S'[Nn] $$

where $N$ is the decimation factor. This process ensures that the new, reduced-size signal or image retains as much of the original's essential information as possible without introducing aliasing artifacts.

Both interpolation and downsampling are capable to reduce an input image or feature map to a low scale, but they are resource-intensive operations that are generally too costly for neural networks.

Solution

Logic: Pooling simplifies the computational complexity of image processing by reducing the spatial size of the image or feature map and retaining "important" information. The fundamental logic of pooling lies in its use of a sliding window (like convolution, but without trainable parameters) that traverses the image in discrete steps, known as strides, and aggregates the values within this kernel.

The kernel, often referred to as the pooling window, has a defined size (for example, 2*2 pixels) that determines how many pixels in each region are considered together. The stride dictates how many pixels the kernel moves over between operations. For instance, a stride of 2 with a 2*2 kernel means the kernel jumps two pixels at a time, resulting in non-overlapping regions.

embed (90).svg

Within each kernel, pooling combines the pixel values into a single value, effectively diminishing the dimensions of the feature maps, which in turn decreases the computational load within the network and enhances computational efficiency.

Methods: There are in two pooling methods widely used in modern CNNs. They are average pooling and max pooling.