When training deep neural networks, two common training convergence issues that can severely hamper the learning process are gradient vanishing and gradient explosion. These problems are fundamental to the architecture and dynamics of backpropagation in deep learning models.

Recall Chain Rule

Gradients in a neural network are computed using the chain rule of calculus.

embed - 2025-02-23T210925.288.svg

For a given weight in layer $l-2$, denoted as $w^{(l-2)}$, the gradient of the loss function $J$ with respect to this weight is calculated as:

$$ \begin{align*}\frac{\partial J}{\partial w^{(l-2)}} &= \frac{\partial J}{\partial a^{(l+1)}} \cdot \frac{\partial a^{(l+1)}}{\partial z^{(l+1)}}\cdot \frac{\partial z^{(l+1)}}{\partial a^{(l)}} \cdot \frac{\partial a^{(l)}}{\partial z^{(l)}}\cdot \frac{\partial z^{(l)}}{\partial a^{(l-1)}}\cdot \frac{\partial a^{(l-1)}}{\partial z^{(l-1)}}\cdot \frac{\partial z^{(l-1)}}{\partial w^{(l-2)}} \\ &= \frac{\partial J}{\partial a^{(l+1)}} \cdot \sigma'({z^{(l+1)}})\cdot w^{(l+1)} \cdot \sigma'({z^{(l)}})\cdot w^{(l)}\cdot \sigma'({z^{(l+1)}})\cdot a^{(l-2)} \end{align*} $$

Here, $a^{(l+1)}$ is the activation and $z^{(l+1)}$ is the linear combination at layer $l+1$. This multiplication of derivatives across layers can lead to very small or very large gradients.

Derivative of Activation Functions

The sigmoid activation function, defined as $\sigma(z) = \frac{1}{1 + e^{-z}}$, has a derivative $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. This derivative is maximized at $0.25$ when $z = 0$ and diminishes towards 0 as $z$ moves away from 0.

image.png

In contrast, the ReLU activation function, defined as $\text{ReLU}(z) = \max(0, z)$, has a derivative that is either $0$ for $z < 0$ or $1$ for $z > 0$. This means that the gradient can either be blocked entirely or passed through unchanged, without any attenuation.

Sigmoid Activation and Gradient Vanishing

In the case of sigmoid activation with small weights initialized at $w^{(l)} = 0.1$, the backward propagation of gradients presents a clear illustration of the vanishing gradient problem. The loss derivative, assuming a simple case where it is 1, begins its backward journey from the output. The derivative of the sigmoid function, calculated as $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, is maximally $0.25$ when $z = 0$ and becomes significantly smaller as inputs move away from this point.

embed - 2025-02-23T210112.720.svg

During backpropagation, at each layer $l$, the gradient $\frac{\partial J}{\partial z^{(l)}}$ is the product of the gradient from the subsequent layer $\frac{\partial J}{\partial \mathbf{z}^{(l+1)}}$, the weight $\frac{\partial \mathbf{z}^{(l+1)}}{\partial \mathbf{z}^{(l)}} =w^{(l+1)}$, and the derivative of the sigmoid at that layer $\sigma'(z^{(l)})$.

Because $\sigma'(z^{(l)})$ is always smaller than 1, and the initialized $w^{(l)} = 0.1$is quite small, this product leads to an exponential decay in the gradient's magnitude as it propagates through each successive layer, becoming negligible by the time it reaches the early layers of the network. This attenuation is exacerbated in deeper networks, where the gradient must travel through more layers, each time being scaled down by the small derivatives and weight values.

ReLU Activation and Gradient Explosion

For ReLU activation with large weights initialized at $w^{(l)}=5$, the dynamics of backpropagation differ significantly. The ReLU function passes gradients back unchanged when neurons are active ($z>0$), with its derivative being 1 in this case, and blocks them completely when inactive ($z\leq0$). This characteristic means that the gradient, for the most part, is not inherently diminished by the activation function itself. However, with large initial weights, the backpropagated gradient from the output can potentially explode as it is propagated backward through the network.