A deeper network should match or surpass the performance of a shallower one, as additional layers could function as identity mappings, passing inputs unchanged and preserving the shallower network’s learned function.
Unfortunately, training deep networks to learn these identity mappings is difficult because achieving an identity function inherently has a low probability within the vast parameter space of possible functions a deep network can represent.
A residual connection in a neural network refers to a shortcut or skip connection that allows the input to a layer to be added directly to its output, facilitating training deeper networks by improving information flow through the network. The motivation for residual connections in neural networks arises from the challenge of training very deep networks.
Think Further: Can you use MAP to solve this problem?
Residual connections can be thought of as "inductive biases" in neural networks. Instead of having the network learn an identity mapping, these connections provide the identity mapping directly. These connections help transfer information from earlier layers to later layers with minimal changes to them. These shortcuts allow the network to bypass intermediate steps.
Consequently, the network can preserve outputs from previous layers unchanged. In this architecture, the primary task of deeper layers is to learn minor modifications, termed "residuals," which refine the inputs to more closely align with the desired final output.
In practice, a residual connection is typically implemented within a module known as a residual block, which is a standard component of the block-based design common in contemporary neural networks. A typical residual block design takes an input $\mathbf z$, which is the output from a previous layer (such as a convolutional layer or another residual block), and produces an output $\mathbf h$, representing the features extracted by the residual block.
The processing steps within a residual block begin with the input passing through a convolutional layer, followed by batch normalization. A ReLU activation function is then applied. The data moves through another convolutional layer, undergoes another batch normalization, and the resulting output is added back to the original input via a skip connection, provided their dimensions align; if not, the input is transformed to ensure compatibility. Finally, another ReLU activation is applied after this addition.
Sum Aggregation Before ReLU: The placement of ReLU after the addition in the residual path is driven by practical considerations rather than strict theory. It introduces mild nonlinearity while allowing the network to learn an identity function when the residual contribution is minimal or absent.
For residual connections to effectively aggregate the unmodified input with the transformed output, the intermediate modules typically need to maintain consistent dimensions, including both spatial size and channel count. This consistency ensures that skip connections can directly add inputs to outputs without requiring additional transformations, preserving the integrity and simplicity of information flow within the network.
In practice, residual connections can include dimension-altering operations, such as downsampling or channel expansion, as demonstrated in the original ResNet architectures. In these cases, transformations (e.g., convolutional operations with stride greater than one or projection layers) are necessary within skip connections to realign dimensions before aggregation.
On the other hand, certain applications, such as image enhancement tasks, specifically benefit from purely residual connections without any dimension changes. In these scenarios, the residual connections serve strictly to enhance features without spatial or channel alterations, thereby maintaining the exact dimensional consistency and directly adding the original input to the transformed output.
When we concatenate a series of Residual Blocks (ResBlocks), we form a ResNet (Residual Network, e.g., ResNet-18). The architecture of a ResNet is thoughtfully divided into three key sections: the initial layers, a sequence of ResBlocks, and the final layers.
Each of these parts plays a specific role in processing input data and generating predictions.