Resizing the data is a common requirement in neural network processing. For image processing, spatial resizing is usually achieved via pooling and other resampling methods. Efficient channel resizing is often achieved using an $1 \times 1$ convolution.

By treating each spatial position as an independent vector of features, the layer can learn to combine these channels optimally, much like a fully connected layer but with significantly lower computational cost.

What is $1 \times 1$ Convolution?

The design of a $1 \times 1$ convolution centers on its role as a channel-wise fully connected layer. For each pixel in the input feature maps, the $1 \times 1$ kernel performs a weighted sum across all input channels to produce a new set of feature channels.

embed - 2025-02-23T140234.627.svg

This mechanism effectively transforms the channel space while keeping the spatial dimensions intact. There is no spatial aggregation because the kernel covers only a single pixel at a time.

Why $1 \times 1$ Convolution is Important

The strength of the 1×1 convolution lies in its ability to act as an efficient, channel-wise transformation tool:

This operation is particularly useful in architectures like Inception networks and ResNets, where managing the dimensionality and complexity of feature maps is crucial for performance and efficiency.

Think Further: Can you come up with integrating $1 \times 1$ convolutions with other convolutional operations to realize conventions convolutions?