Resizing the data is a common requirement in neural network processing. For image processing, spatial resizing is usually achieved via pooling and other resampling methods. Efficient channel resizing is often achieved using an $1 \times 1$ convolution.
By treating each spatial position as an independent vector of features, the layer can learn to combine these channels optimally, much like a fully connected layer but with significantly lower computational cost.
The design of a $1 \times 1$ convolution centers on its role as a channel-wise fully connected layer. For each pixel in the input feature maps, the $1 \times 1$ kernel performs a weighted sum across all input channels to produce a new set of feature channels.
This mechanism effectively transforms the channel space while keeping the spatial dimensions intact. There is no spatial aggregation because the kernel covers only a single pixel at a time.
The strength of the 1×1 convolution lies in its ability to act as an efficient, channel-wise transformation tool:
This operation is particularly useful in architectures like Inception networks and ResNets, where managing the dimensionality and complexity of feature maps is crucial for performance and efficiency.
Think Further: Can you come up with integrating $1 \times 1$ convolutions with other convolutional operations to realize conventions convolutions?