Neural network optimization relies on computing local gradients for each module, crucial for backpropagation. In MLPs, differentiability is generally ensured through operations like matrix multiplication and activation functions such as Sigmoid. However, architectures like CNNs introduce elements like Max Pooling and ReLU activation, which are non-differentiable at certain points (e.g., $x=0$), necessitating the use of subgradients.
Subgradient offers a solution for non-differentiable functions by providing a viable direction for function optimization, essential for gradient descent and ensuring convergence of the loss value $J$. While subgradients can be dynamically computed to approximate gradients in non-smooth areas, modern neural networks often rely on predefined subgradients. This approach simplifies optimization by directly apply predetermined directions for optimizing across the entire loss function landscape, including non-differentiable regions, facilitating efficient backpropagation without the need for on-the-fly gradient approximation which is extremely heavy in computation.
Gradient routing is a critical concept in subgradient descent for handling functions such as maximum, minimum, sorting, and shuffling, where the output is directly influenced by specific inputs.
Maximum Function - A common example is determining the maximum of two values a
and b
:
def max(a, b):
if a > b: return a
else: return b
In this scenario, the gradient of the $\max(a, b)$ function depends on which input, $a$ or $b$, contributes to the output, essentially "routing" the gradient through the path of the chosen input. The mechanism of gradient determination can be detailed as follows:
When $a > b$, the function $\max(a, b)$ selects $a$ as the output. In this case, the gradient with respect to $a$ is $1$, indicating that any small increase in $a$ directly increases the function's output by the same amount. Conversely, the gradient with respect to $b$ is $0$ since increasing $b$ does not affect the output as long as $a$ remains larger. This reflects the idea that the gradient "flows" through $a$, while $b$ has no influence in this scenario.$\nabla_{\text{max(a, b)}}(a=2, b=1)=\begin{bmatrix} 1 \\ 0 \\ \end{bmatrix}$
Conversely, if $a < b$, the function $\max(a, b)$ outputs $b$. Here, the gradient with respect to $b$ is $1$, signifying that an incremental increase in $b$ will raise the output value identically. The gradient with respect to $a$ becomes $0$, highlighting that $a$ does not contribute to the output when $b$ is larger, and thus, any changes in $a$ do not impact the function's result. $\nabla_{\text{max(a, b)}}(a=1, b=2)=\begin{bmatrix} 0 \\ 1 \\ \end{bmatrix}$
In the special case where $a = b$, the gradients need to be defined manually. A common choice is to assign $\frac{\partial}{\partial a} \max(a, b) = 0.5$ and $\frac{\partial}{\partial b} \max(a, b) = 0.5$, suggesting that both inputs equally contribute to a slight increase in the function's output. This allocation is arbitrary and serves as a simple way to handle the indeterminacy at the point of equality, ensuring that the gradient descent algorithm can still proceed. $\nabla_{\text{max(a, b)}}(a=1, b=1)=\begin{bmatrix} 0.5 \\ 0.5 \\ \end{bmatrix}$
This nuanced approach to gradient routing allows for the dynamic computation of gradients based on the operational context, ensuring that the optimization process remains robust, even when dealing with non-differentiable functions.
Sorting Function - The sorting function, which rearranges elements in a specified order, can also be associated with gradients through the concept of a "routing matrix." This matrix maps the rearrangement process, crucial for backpropagation in neural networks. For example, in a scenario where elements are sorted into a new order ($1\rightarrow3$, $2\rightarrow1$, $3\rightarrow2$), the routing process can be visualized as below:
The routing matrix of this sorting process looks like:
$$ \frac{d}{dh}\text{sorting}=\begin{bmatrix} 0 & 0& 1\\ 1 & 0& 0\\ 0 & 1& 0\\ \end{bmatrix} $$
Here, the matrix columns and rows represent the original and new positions of elements, respectively, with a $1$ indicating the routing of an input to an output position. This setup informs the backpropagation algorithm on directing gradients appropriately.
Key Points: