In Multilayer Perceptrons (MLPs), computing gradients is relatively straightforward due to operations such as matrix multiplication and the application of activation functions like Sigmoid. However, in architectures like Convolutional Neural Networks (CNNs), elements such as Max Pooling and ReLU activation introduce complexities. These components are non-differentiable at certain points (e.g., $x=0$) necessitating the use of subgradients to handle these discontinuities in the derivative.
Subgradient provides a viable optimization direction for non-differentiable functions, crucial for gradient descent and ensuring the convergence of the loss function $J$. In modern neural networks, predefined subgradients are often used to simplify optimization.
This method avoids the computational burden of dynamically computing gradients in non-smooth areas by applying predetermined directions for optimization across the entire loss landscape, including non-differentiable regions. This facilitates efficient backpropagation and reduces computational overhead.
Gradient routing is a technique used to address scenarios involving conditional operations, such as 'if-else' statements, within differentiable programming. It effectively manages how gradients are propagated through functions where decisions change the path of computation based on specific inputs. Unlike corner cases, it does not require the use of subgradients.
Maximum Function - A common example is determining the maximum of two values a
and b
:
def max(a, b):
if a > b: return a
else: return b
In this scenario, the gradient of the $\max(a, b)$ function depends on which input, $a$ or $b$, contributes to the output, essentially "routing" the gradient through the path of the chosen input.
The mechanism of gradient determination can be detailed as follows:
When $a > b$, the function $\max(a, b)$ selects $a$ as the output. In this case, the gradient with respect to $a$ is $1$, indicating that any small increase in $a$ directly increases the function's output by the same amount. Conversely, the gradient with respect to $b$ is $0$ since increasing $b$ does not affect the output as long as $a$ remains larger. This reflects the idea that the gradient "flows" through $a$, while $b$ has no influence in this scenario.
$$ \nabla_{\text{max(a, b)}}(a=2, b=1)=\begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} $$
Conversely, if $a \leq b$, the function $\max(a, b)$ outputs $b$. Here, the gradient with respect to $b$ is $1$, signifying that an incremental increase in $b$ will raise the output value identically. The gradient with respect to $a$ becomes $0$, highlighting that $a$ does not contribute to the output when $b$ is larger, and thus, any changes in $a$ do not impact the function's result.
$$ \nabla_{\text{max(a, b)}}(a=1, b=2)=\begin{bmatrix} 0 \\ 1 \\ \end{bmatrix} $$
Let’s consider a case that $x$ is $1$. Let’s see how the gradient $\frac{dz}{dx}$ is calculated in such case.
The diagram illustrates a computational graph involving two functions $a = x^2$ and $b = 2x$ that feed into a $\max$ function, which selects the greater value. Given $x=1$, the values computed are $a=1$ and $b=2$. The Max function outputs $z=2$, corresponding to $b$. The backward gradient, shown by the dashed red line, indicates that the derivative $\frac{dz}{dx}$ propagates back through the branch associated with 𝑏 b, where the derivative is $2$, because the Max function outputs the value from $b$. The derivative via $a$ does not contribute because $b > a$ at $x = 1$. Thus, the gradient $\frac{dz}{dx}$ with respect to $x$ is entirely determined by the path through $b$.
Sorting Function - It rearranges elements in a specified order, can also be associated with gradients through the concept of a "routing matrix." This matrix maps the rearrangement process, crucial for backpropagation in neural networks. For example, in a scenario where elements are sorted into a new order ($1\rightarrow3$, $2\rightarrow1$, $3\rightarrow2$), the routing process can be visualized as below:
The routing matrix of this sorting process looks like:
$$ \frac{d}{dh}\text{sorting}=\begin{bmatrix} 0 & 0& 1\\ 1 & 0& 0\\ 0 & 1& 0\\ \end{bmatrix} $$
Here, the matrix columns and rows represent the original and new positions of elements, respectively, with a $1$ indicating the routing of an input to an output position. This setup informs the backpropagation algorithm on directing gradients appropriately.
Key Points: