Convolution is a mathematical operation performed by sliding a function known as a kernel or filter, denoted by $\mathbf{w}$, over another function or signal, denoted by $\mathbf{x}$.
In the simplest case of a single-channel signal (such as mono audio), the dimensions of $\mathbf{x}$ are represented as $[T, 1]$, where $T$ is the length of the sequence. For a multi-channel signal (such as stereo audio), the dimensions of $\mathbf{x}$ are $[T, C]$ where $C$ denotes the number of channels, e.g., $C=2$ for stereo audio.
In the context of a grayscale image, the dimensions are $[H, W, 1]$, whereas for an RGB color image, which is a more standard format in image processing, the dimensions are $[H, W, C=3]$. For RGB multiview video signals, the dimensions extend to $[T, V, H, W, C=3]$.
Convolution can be applied to these and other more complex signal types, with the requirement that the kernel's dimensions are compatible with the dimensions of the data.
Let's begin by using convolution to process a one-dimensional signal. First, imagine a segment of a one-dimensional audio signal as depicted on the left below.
For now, we do not really care about what are these value. Just treat them as variable$[w_1, w_2, w_3]$.
Next, we'll apply convolution to modify the signal. But before we proceed, we must predefine the size of the convolution kernel. For an illustrative example, let's choose a kernel size of 3, resulting in a convolution kernel as shown on the right above.
Convolution can be explained in a single terminology, the moving inner product. The process is illustrated below:
Due to the limitations of drawing, we only show the final results on the left, while omitting the calculations for each convolution step.
At each position, convolution computes the dot product of the kernel with an overlapping sub-segment of the signal, effectively measuring the similarity between the kernel and each local segment of the input signal. This process results in a new transformed signal that highlights features captured by the kernel, which makes convolution especially useful in applications like image and audio processing where it helps in feature detection and analysis.
A typical example of this in practice is a 1-channel 2-D convolution, often visualized in the context of image processing.
The process starts with a predefined 3*3
template, or kernel, which is [1, 0, 1; 0, 1, 0; 1, 0, 1]
. This template is used to analyze the image by focusing on every receptive field (sliding over the whole image) at a time, specifically a 3*3 area centered around a pixel located at coordinates (x, y). This means for each position, the area of the image being examined extends from (x-1, y-1) to (x+1, y+1).
For each of these 3*3 sections of the image, the template’s values are multiplied by the corresponding pixel values in the receptive field. These products are then summed together to create a single value. This value acts as a score that indicates how closely the image section matches the pattern the kernel is designed to identify, such as edges or textures.
This score is then assigned to a new image, at the position that corresponds to the central pixel (x, y) of the original image section being analyzed. Through this systematic application across the entire image, a new image is formed. This new image, or feature map, highlights the presence of specific features detected by the kernel throughout the original image, effectively capturing patterns and textures relevant to further analysis or recognition tasks.