In neural networks, especially beyond simple model like $\sigma(wx+b)$, we often handle derivative calculations involving multiple parameters, where $w$ becomes a vector $\mathbf{w}$, not just a scalar. This is also true for input $\mathbf{x}$ due to the intensive uses of matrix multiplications.

Partial Derivative The Product Rule

Consider the function $y = g(x) \cdot h(x)$, We want to compute $dy/dx$. Let us define two “intermediate” functions: $u=g(x)$ and $v=h(x)$. Hence, $y$ becomes a function of two variables $u$ and $v$: $y(u, v) = u \cdot v$.

embed (83).svg

Although ultimately $u$ and $v$ both depend on the single variable $x$, we can temporarily regard $u$ and $v$ as independent variables. Afterward, we will track how each depends on $x$.

Step 1: The Gradient of $y$ with respect to $(u, v)$

We begin by taking partial derivatives of $y$ with respect to $u$ and $v$. Collecting these partial derivatives into a row vector (a common choice in many chain-rule formulations) gives us:

$$ \nabla_{u,v} y \;=\; \begin{bmatrix} \frac{\partial y}{\partial u} &\frac{\partial y}{\partial v} \end{bmatrix}. $$

For $y(u,v) = u\cdot v$, these partials are:

$$ \frac{\partial y}{\partial u} = v,\quad\frac{\partial y}{\partial v} = u. $$

So,

$$ \nabla_{u,v} y \;=\; \begin{bmatrix} v & u \end{bmatrix}. $$

Dimensionally, $\nabla_{u,v} y$ is a $1 \times 2$ matrix (a row vector).

Step 2: The Derivatives of $(u, v)$ with respect to $x$

Next, we look at how $u$ and $v$ individually depend on $x$. We form another vector that collects $\frac{du}{dx}$ and $\frac{dv}{dx}$. In column-vector form,

$$ \nabla_{x}(u, v) \;=\; \begin{bmatrix} \frac{\partial u}{\partial x} \\[6pt] \frac{\partial v}{\partial x} \end{bmatrix} \;=\; \begin{bmatrix} \frac{d}{dx}\bigl(g(x)\bigr) \\[6pt] \frac{d}{dx}\bigl(h(x)\bigr) \end{bmatrix}. $$

This is a $2 \times 1$ matrix (a column vector).

Matrix Multiplication is Chain Rule Multiplication

The chain rule tells us that when $y$ depends on $x$ through both $u$ and $v$, we must sum the contributions from each path $(x \to u \to y)$ and $(x \to v \to y)$.

One elegant way to formalize that is through matrix multiplication:

embed (83).svg

  1. $\nabla_{u,v} y$ (the row vector of partial derivatives of y with respect to $u$ and $v$)
  2. multiplied by $\nabla_x (u,v)$ (the column vector of derivatives of $u$ and $v$ with respect to $x$)