A residual connection in a neural network refers to a shortcut or skip connection that allows the input to a layer to be added directly to its output, facilitating deeper network training by improving information flow through the network.
Problem Statement: In the context of neural network design, it's hypothesized that a deeper neural network, which includes additional layers on top of a shallower network's structure, could theoretically achieve at least the same level of training performance as its shallower counterpart. This is based on the idea that the extra layers could effectively become identity mappings, passing their inputs through unchanged, thereby preserving the shallower network's behavior. Nevertheless, in practice, training deep neural networks to learn such identity mappings through gradient descent is not straightforward, often leading to difficulties in achieving optimal performance due to issues like vanishing gradients.
Residual connections in neural networks make it easier for information to move from early layers in the network to later ones by creating shortcuts. These shortcuts allow the network to skip some steps in between. This setup lets the network keep some outputs from earlier layers without changing them. In this design, the job of the deeper layers is to learn the small changes, called "residuals," needed to adjust the inputs closer to what we want the final output to be. This means the deeper layers only have to learn the extra adjustments needed, making it simpler for the network to improve its accuracy.
A residual connection is typically implemented within a residual block. First, in modern neural networks, the design is often block-based. A block is a combination of multiple neural network layers. Similar to programming, this block-based design makes the writing and designing of neural networks more intuitive and efficient.
Below is a standard design of a residual block, where the input u
is the feature output from the previous layer (which might be a convolutional layer, or another residual block), and the output v
is the feature extracted by this residual block.
The processing steps in the residue block can be describe as below:
f(u)
within the residual path contributes minimally or not at all, effectively making the network learn an identity function for u
, the impact of ReLU is nuanced. Since inputs are always non-negative (no matter if the input is the image or the rectified feature map), applying ReLU in such scenarios—where it acts directly on the input—does not alter the propagation of u
, maintaining the integrity of the information flow.We concatenate a series of Residual Blocks (ResBlocks) to construct a ResNet (Residual Network e.g., ResNet-18). The architecture of a ResNet is thoughtfully segmented into three main parts: initial layers, a sequence of ResBlocks, and final layers, each serving a distinct purpose in the process of learning from input data to making predictions.
The initial layers of a ResNet architecture are designed for preliminary feature extraction and channel depth conversion. This part typically begins with a convolutional layer that has a large kernel size and stride, aiming to capture basic patterns such as edges and textures from the input images. This is followed by batch normalization and ReLU activation to introduce non-linearity and stabilize the training process. A max pooling layer is included to further reduce the dimensionality of the feature maps, making the network more computationally efficient and invariant to small spatial shifts in the input.