Gradient vanishing is a challenge encountered in training deep neural networks, where the gradients of the network's loss function become increasingly smaller as they are propagated back through the network during training. This phenomenon occurs due to the multiplication of gradients through many layers, leading to an exponential decrease in the magnitude of gradients for layers closer to the input. As a result, these early layers learn very slowly or not at all, making it difficult to effectively train the network, especially as its depth increases.
The issue of gradient vanishing is often attributed to activation functions like sigmoid or tanh, which can significantly reduce the gradient's magnitude. Specifically, for the sigmoid
activation function, the derivative tends to approach zero for values of x greater than 6 or less than -6, due to the flattening of the curve in these regions. Moreover, the highest derivative value for sigmoid is merely 0.25.
When a network consists of just a single layer, this compression of the gradient may not be notably detrimental. However, as networks become deeper, with more layers and consequently more activation functions positioned sequentially, the effect amplifies. The process of backpropagation, which computes gradients using the chain rule, involves multiplying these small derivative values across multiple layers. This multiplication can lead to the gradient nearing zero, severely hampering the network's ability to learn from data effectively.
The accompanying diagram illustrates how gradient calculation via backpropagation is impacted in such scenarios. In essence, with each additional layer and sigmoid activation, the potential for the gradient to vanish increases, posing a challenge for the training of deep neural networks.