Padding is the process of appending zeros or other values around the edge of an image input.

Problem Statement: In CNN, convolution operations cause the inevitable reduction of feature map sizes, particularly problematic at image edges. This size decrease complicates the development of deep networks, particularly when the outputs are required to be of the same size as the input or larger. Without effective padding strategies, essential details and structural integrity of the image are compromised, hindering the network's ability to analyze and learn from the full extent of the data, especially in deeper layers.

Case: Applying a 3*3 convolution filter to an h*w image reduces the output feature map to (h-2)*(w-2), due to the lack of enough surrounding pixels at image edges for the convolution without exceeding its bounds.

https://github.com/yyhtbs-yye/course_images/blob/main/3 (1).gif?raw=true

Analysis

When applying a convolution operation with a kernel of size k to a one-dimensional input of length m, the resulting output length (#output) is calculated as m - k + 1. This equation arises because as the kernel slides across the input from left to right, it starts at the first element and stops at the point where the last element of the kernel aligns with the last element of the input. Thus, the kernel can only make m - k + 1 complete passes across the input before it would extend past the end of the input sequence.

This can be visualized in the diagram where the kernel starts aligned with the first input element, producing the first output element. As the kernel slides to the right, each new position aligns with a new output element until the kernel reaches the end of the input sequence. The final kernel position, aligned with the last input element, produces the last output element.

embed (87).svg

Therefore, the length of the output feature map is shorter than the input length by k - 1 elements, accounting for the kernel size minus one because the kernel's first and last positions are aligned with the input's first and last elements respectively, which adds one to the count of output elements. The +1 in the formula accounts for the first position of the kernel aligning with the first element of the input, which is also the position for the first output.

The concept scales naturally to two dimensions; thus, for a h*w dimensional image convolved with a k*k kernel, the output dimensions would be (h-k+1)*(w-k+1). For example, applying a 3*3 kernel to a 64*64 image yields an output of size 62*62.

Solution

Logic: to compensate for the loss of (k-1) pixels in both the horizontal and vertical dimensions, we can simply pad them in advance. Typically, we pad the image on all sides. For a kernel of size k=3, where k-1=2, we add one pixel of padding to the top, bottom, left, and right edges of the image. For a 64*64 image, after adding a one-pixel padding on each side, the new dimensions become 66*66. When we convolve this padded image with a 3*3 kernel, we lose two pixels on each dimension, resulting in an output of 64*64, matching the size of the original input.

Methods: Several padding techniques exist, each requiring a choice of method to appropriately extend the image's borders, as illustrated by common methods below.

Untitled

Padding and Image Structure

To illustrate, let's compare images before and after convolution (using a simple average filter) with and without padding.

Untitled

With padding, the convolution process better preserves the original structure of the image. Without padding, while the overall shape of the ring may still be recognizable, crucial details—such as the expected lower values at the outermost pixels—are lost, leading to distortion in the image structure. This distortion can compound with subsequent layers, causing deeper layers to analyze the image with bias.