In optimization, especially in scenarios involving multiple variables and multidimensional outputs, understanding the computation and aggregation of gradients is useful. This presentation will show the concepts of the Jacobian matrix and the weight gradient matrix, explaining how to utilize these concepts for gradient aggregation to optimize a single objective function, particularly when we focus on optimizing the matrix $\mathbf{W}$.
The Jacobian matrix is a matrix of all first-order partial derivatives of a function, in the form of:
$$ \frac{d\mathbf{y}}{d\mathbf{x}}= \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} $$
Here, $\mathbf{y}$ represents a multidimensional vector function of $\mathbf{x}$, indicating the influence of each input on each output. In the straightforward case of $\mathbf{y} = \mathbf{W}\mathbf{x}$, deriving the Jacobian matrix is straightforward, as it is simply $\mathbf{W}$.
Example: If we have a function:
$$ \mathbf{y} =\begin{bmatrix}y_1 \\y_2\end{bmatrix}\quad \text{where } y_1 = x_1^2 + x_2, \quad y_2 = \sin(x_1 x_2), $$
Then the Jacobian matrix is:
$$ \frac{d\mathbf{y}}{d\mathbf{x}}=\begin{bmatrix}\frac{\partial (x_1^2 + x_2)}{\partial x_1} & \frac{\partial (x_1^2 + x_2)}{\partial x_2} \\\frac{\partial \sin(x_1 x_2)}{\partial x_1} & \frac{\partial \sin(x_1 x_2)}{\partial x_2}\end{bmatrix}=\begin{bmatrix}2x_1 & 1 \\x_2 \cos(x_1 x_2) & x_1 \cos(x_1 x_2)\end{bmatrix} $$
This Jacobian matrix provides insight into how changes in the input $\mathbf{x}$ affect the output $\mathbf{y}$ locally in a linearized sense.
In machine learning, the focus is on optimizing the weight matrix $\mathbf{W}$ rather than the input vector $\mathbf{x}$. We often refrain from using a traditional Jacobian matrix for this purpose, primarily due to its high sparsity. The connections between weights and outputs are not densely established, which means the majority of entries in a full Jacobian matrix would be zero.
Consider the partial derivative matrix below:
$$ \frac{d\mathbf{y}}{d\mathbf{W}}= \begin{bmatrix} \frac{\partial y_1}{\partial w_{11}} & \frac{\partial y_1}{\partial w_{11}} & \cdots & \frac{\partial y_1}{\partial w_{11}} \\ \frac{\partial y_2}{\partial w_{12}} & \frac{\partial y_2}{\partial w_{12}} & \cdots & \frac{\partial y_2}{\partial w_{12}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial w_{mn}} & \frac{\partial y_m}{\partial w_{mn}} & \cdots & \frac{\partial y_m}{\partial w_{mn}} \end{bmatrix} $$
If you visualize a layer of a neural network, you will notice that $y_1$ is not connected to $w_{21}$ , or any other weights where the first subscript of $w$ is not 1. This illustrates that in typical network architectures, each output node $y_i$ is connected only to a specific subset of weights.
Therefore, instead of using this sparse matrix directly, we often employ a weight gradient matrix tailored to capture only the significant, non-zero gradients which influence learning. This approach optimizes computational efficiency and aligns with the network's architecture, focusing only on the relevant weight-output relationships necessary for updating the model during training.
$$ \frac{d\mathbf{y}}{d\mathbf{W}}= \begin{bmatrix} \frac{\partial y_1}{\partial w_{11}} & \frac{\partial y_1}{\partial w_{12}} & \cdots & \frac{\partial y_1}{\partial w_{1n}} \\ \frac{\partial y_2}{\partial w_{21}} & \frac{\partial y_2}{\partial w_{22}} & \cdots & \frac{\partial y_2}{\partial w_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial w_{m1}} & \frac{\partial y_m}{\partial w_{m2}} & \cdots & \frac{\partial y_m}{\partial w_{mn}} \end{bmatrix} $$
In the model $\mathbf{y} = \mathbf{W}\mathbf{x}$, based on matrix multiplication, it is not difficult to derive that the value of this matrix is as follows:
$$ \frac{d\mathbf{y}}{d\mathbf{W}}= \begin{bmatrix} x_1 & x_2 & \cdots & x_n \\ x_1 & x_2 & \cdots & x_n \\ \vdots & \vdots & \ddots & \vdots \\ x_1 & x_2 & \cdots & x_n \end{bmatrix} $$