Flexible Input-Output Dimensions

Traditional machine learning often involves fixed input and output dimensions. For example, a neural network might be designed to handle a $16 \times 16$ grayscale handwritting digit image and output a class label.

embed - 2025-02-18T215658.429.svg

This conventional approach, though effective for many tasks, limits adaptability and scalability in real-world scenarios where images are rarely the same size. This section examines how modern neural networks overcome these constraints by providing adaptable inputs, outputs, and architectures suited for a variety of dynamic applications.

Flexible Sequential Dimension

For neural networks handling varying input sizes, such as images, a naive approach might be to resize all images to a consistent size using conventional image processing tools. However, this is not always optimal, as resizing can distort object scales by compressing them horizontally or vertically, and reducing the image size can lead to a loss of important details. Thus, it is often better to process images at their original scale. In convolution neural network, there are two common operators will support this:

Convolution is an operation natively support this, as they process each pixel in an image along with its surrounding neighbors through an iterative method. With appropriate configuration, the input and output sizes can remain the same.
Pooling is an operation that, unlike convolution, performs downsampling by reducing the image size by a certain scale. As a result, the output size is proportionally smaller than the input size. This operation is advantageous as it aggregates information while preserving the overall structure of the data.

Something Twisting Forward: In early convolutional neural network architectures like LeNet-5, AlexNet, and the VGG series, convolutions and pooling were used to extract features from images, followed by fully connected layers for downstream tasks such as classification.

Twist Backward: However, the presence of fully connected (FC) layers, which require predefined and fixed input dimensions, limits these models from supporting arbitrary input sizes, despite convolutions and pooling being agnostic to input dimensions. Thus, these early CNNs do not accommodate arbitrary input sizes.

embed - 2025-02-27T163624.852.svg

Twist Forward: In later neural network architectures like ResNet and Inception, FC layers are still used, but they are preceded by an adapter called Global Average Pooling (GAP). The GAP condenses spatial information into a single value for each feature map, regardless of the input image size. As a result, the combination of Convolution + Pooling + GAP + FC enables support for arbitrary input image sizes.

embed - 2025-02-27T162902.531.svg

Twist Backward: However, to easy the training and inference, the data within a batch should be consistency in spatial size, otherwise GPU will not be able to process them efficiently. Therefore, usually when they trained, they are using the same input image size for each batch but different batch can have different size.

Flexible-In-Corresponding-Output (YOLO Architecture)

Some computer vision applications require flexible outputs. For example, a model must detect objects in different regions of an image, where larger images may contain more objects, and smaller images usually contain fewer objects.

In the figure below, the object detection neural network (referred to as the YOLO backbone) downscales the input image using Convolution+ Pooling and other layers by a factor of $32$. Given an input size of $320×320$, the network's output becomes $10×10$.

embed - 2025-01-29T001230.521.svg

In the mosaic-like output, the red box (sized $1\times1$, representing a $32\times 32$ cell in the original image) is responsible for detecting the dog because the dog's center falls within that grid cell.

Because there is no fully connected layer, this downscaling approach elegantly handles varying input sizes and their corresponding outputs. For example, if the input image size increases to $640×640$, the network's output becomes $20×20$. Each position in the $20×20$ grid corresponds to a region in the original $640×640$ image, maintaining the same principle of assigning responsibility for object detection to specific grid cells.

Twisting Again: For efficient YOLO training and inference, the input size is often recommended to be resized to $640 \times 640$. If higher accuracy is desired, it can be increased to $1280 \times 1280$. However, resizing is optional. If you do not intend to use resize, to maintain training efficiency, you must crop or pad each image to the same size within a batch, though different batches can have varying sizes.