1×1 Convolution

Resizing the data is a common requirement in neural network processing. For image processing, spatial resizing is usually achieved via pooling and other resampling methods. Efficient channel resizing is often achieved using an $1 \times 1$ convolution.

By treating each spatial position as an independent vector of features, the layer can learn to combine these channels optimally, much like a fully connected layer but with significantly lower computational cost.

What is $1 \times 1$ Convolution?

The design of a $1 \times 1$ convolution centers on its role as a channel-wise fully connected layer. For each pixel in the input feature maps, the $1 \times 1$ kernel performs a weighted sum across all input channels to produce a new set of feature channels.

embed - 2025-02-23T140234.627.svg

This mechanism effectively transforms the channel space while keeping the spatial dimensions intact. There is no spatial aggregation because the kernel covers only a single pixel at a time.

Why $1 \times 1$ Convolution is Important

The strength of the 1×1 convolution lies in its ability to act as an efficient, channel-wise transformation tool:

Channel Mixing: It learns to re-weight and combine features across channels, akin to a fully connected layer, without merging spatial information.
Computational Efficiency: Since it operates on one pixel at a time, the operation is lightweight compared to larger kernels that require extensive spatial computations.
Flexibility: By preserving spatial resolution, the $1 \times 1$ convolution can be interleaved with spatially aggregating layers, enhancing the network's overall representational capacity.

This operation is particularly useful in architectures like Inception networks and ResNets, where managing the dimensionality and complexity of feature maps is crucial for performance and efficiency.

Think Further: Can you come up with integrating $1 \times 1$ convolutions with other convolutional operations to realize conventions convolutions?