Neural network optimization relies on computing local gradients for each module, crucial for backpropagation. In MLPs, differentiability is generally ensured through operations like matrix multiplication and activation functions such as Sigmoid. However, architectures like CNNs introduce elements like Max Pooling and ReLU activation, which are non-differentiable at certain points (e.g., $x=0$), necessitating the use of subgradients.

Subgradient offers a solution for non-differentiable functions by providing a viable direction for function optimization, essential for gradient descent and ensuring convergence of the loss value $J$. While subgradients can be dynamically computed to approximate gradients in non-smooth areas, modern neural networks often rely on predefined subgradients. This approach simplifies optimization by directly apply predetermined directions for optimizing across the entire loss function landscape, including non-differentiable regions, facilitating efficient backpropagation without the need for on-the-fly gradient approximation which is extremely heavy in computation.

Subgradient for Gradient Routing

Gradient routing is a critical concept in subgradient descent for handling functions such as maximum, minimum, sorting, and shuffling, where the output is directly influenced by specific inputs.

Maximum Function - A common example is determining the maximum of two values a and b:

def max(a, b):
    if a > b: return a
    else: return b

In this scenario, the gradient of the $\max(a, b)$ function depends on which input, $a$ or $b$, contributes to the output, essentially "routing" the gradient through the path of the chosen input. The mechanism of gradient determination can be detailed as follows:

This nuanced approach to gradient routing allows for the dynamic computation of gradients based on the operational context, ensuring that the optimization process remains robust, even when dealing with non-differentiable functions.

Sorting Function - The sorting function, which rearranges elements in a specified order, can also be associated with gradients through the concept of a "routing matrix." This matrix maps the rearrangement process, crucial for backpropagation in neural networks. For example, in a scenario where elements are sorted into a new order ($1\rightarrow3$, $2\rightarrow1$, $3\rightarrow2$), the routing process can be visualized as below:

embed (14).svg

The routing matrix of this sorting process looks like:

$$ \frac{d}{dh}\text{sorting}=\begin{bmatrix} 0 & 0& 1\\ 1 & 0& 0\\ 0 & 1& 0\\ \end{bmatrix} $$

Here, the matrix columns and rows represent the original and new positions of elements, respectively, with a $1$ indicating the routing of an input to an output position. This setup informs the backpropagation algorithm on directing gradients appropriately.

Key Points:

Subgradient for Non-Differentiable Functions