Convolution is a mathematical operation where a kernel or filter $\mathbf{w}$ slides over a sequence $\mathbf{x}$, transforming the data. It applies to various dimensions, from simple mono audio to complex 3D video data, as long as the kernel's dimensions are compatible.

1D Convolution

Let's begin by applying convolution to a one-dimensional signal.

A single-channel signal, such as mono audio, has dimensions $\mathbf{x} \in [11,1]$, where $11$ represents the sequence length.

We will use a convolutional kernel $\mathbf{w}$ of size 3.

embed - 2025-02-27T175018.509.svg

Convolution is a specialized form of matrix multiplication where, instead of multiplying entire matrices at once, we perform dot products between a small filter (kernel) and different local regions of the input.

This is also known as the moving inner (dot) product, similar to how dot products are used in matrix multiplication, but applied in a localized, overlapping manner. The process is illustrated on the right.

At each position, convolution computes the dot product between the kernel and a sub-segment of the input signal. One way to view it is as measuring similarity, producing a transformed signal that highlights patterns matching the kernel, aiding feature extraction.

embed - 2025-03-05T161320.769.svg

Note: In signal processing, the kernel array needs to be reversed before applying the operation (e.g., $w_1, w_2, w_3 \rightarrow w_3, w_2, w_1$). However, in convolution neural networks, we use the kernel in its original order ($w_1, w_2, w_3$) because the network learns the optimal weights automatically.

2D Convolution - Single-Channel Input Single-Channel Output

In the case of 2D data, such as a grayscale image, the dimensions are $[H, W, 1]$, where $H$ and $W$ represent height and width, respectively.

The figure on the right illustrates the convolution of a $3\times3$ window (kernel) over a $16\times 16$ image. At each position, the window captures a $3 \times 3$ region of the image, and its values are computed as the dot product between the window and the corresponding overlapping region in the image.

sliding_window_animation (2).gif

To better understand the 2D convolution process step by step, let's consider a numerical example. We use the following $3 \times 3$ kernel:

$$ \mathbf{k} = \begin{bmatrix} 1& 0& 1\\ 0& 1& 0\\ 1& 0& 1\end{bmatrix} $$

Note: the width in spatial dimensions is typically denoted by $W$. Therefore, we use $\mathbf{k}$ represent weights instead of $\mathbf{w}$, following conventional notation.

Although the sliding window moves across all pixel positions in the image, we will focus on three specific positions, each highlighted by a different colored cross in the figure below.

embed - 2025-02-28T153144.125.svg

Let's focus on the blue box, centered at position $(2, 10)$, the blue box capture a small local region of values

embed - 2025-03-05T162800.034.svg

$$ \mathbf{x}_{3\times3}[2, 10] = \begin{bmatrix} 0& 0& 0\\ 0& 1& 1\\ 1& 1& 1\end{bmatrix} $$

To compute the convolution at this position, we take the element-wise product of this region with the kernel and sum the results, effectively performing a dot product operation.

$$ \begin{align*}\mathbf{z}[2, 10] & = \text{dot}(\mathbf{x}_{3\times3}[2, 10],\mathbf{K})= \left<\begin{bmatrix} 1& 0& 1\\ 0& 1& 0\\ 1& 0& 1\end{bmatrix}, \begin{bmatrix} 0& 0& 0\\ 0& 1& 1\\ 1& 1& 1\end{bmatrix} \right>\\&= 1\times0+0\times 0+1\times 0+0\times 0+1\times 1+0\times 1+1\times 1+0\times 1+1\times 1\\&=3 \end{align*} $$

This value is then assigned to the output hidden image $\mathbf{z}$, at the position that corresponds to the central pixel $(2, 10)$.

The process systematically iterates across the image, forming the hidden image $\mathbf{z}$. This feature map highlights key patterns and textures detected by the kernel, aiding further analysis or recognition.