In the previous section, we saw what computational graphs and the chain rule are and their relationship. The essence of our chain rule is to compute local gradients hop by hop, and then obtain the global gradient through the product of local gradients.
In this section, we first perform the numerical calculation of gradients for a general function. We then explore how the principles of dynamic programming, combined with the chain rule for gradient calculation and cache, can be applied to neural networks to implement backpropagation, thereby reducing computational costs.
Consider the same function $y=\sin((x+1)^2)$, we want to calculate the gradient $\frac{dy}{dx}$.
Using the chain rule, we can analytically derive its derivative with respect to $x$:
$$ \frac{d\sin((x+1)^2)}{dx}=\cos((x+1)^2)\cdot2(x+1)\cdot1 \\ \text{OR}\\\frac{d\sin((x+1)^2)}{dx}=\frac{d\sin((x+1)^2)}{dx}=\cos(h)\cdot2z\cdot 1 $$
The primary consideration in forward propagation is choosing between two mathematical expressions: $\cos((x+1)^2)\cdot2(x+1)\cdot1$ or $\cos(h)\cdot2z\cdot1$.
In neural network, we opt for $\cos(h)\cdot2z\cdot1$, involving variables $x$, $z$, and $h$, because it requires only 5 computational steps. In contrast, the alternative expression, $\cos((x+1)^2)\cdot2(x+1)\cdot1$, requires more computations.
This concept is important in the forward propagation stage of neural network training. During forward propagation, given the initial guess, (e.g., the $x$ here), we compute and store the output values (e.g., $z$, $h$ and $y$) of each neuron layer by layer. These stored output values ease the calculation of gradients.
Example: When $x$ is given a specific value, say $x=1$, we can directly compute the values of $z$ and $h$ as $z=x+1=2$ and $h=z^2=4$.
:
Again: If we evaluate each expression from $x$ to $y$ to derive and store all intermediate values $z$, $h$, and $y$, then it simplifies the derivative calculation significantly:
$$ \frac{d\sin((x+1)^2)}{dx}=\cos(h)\cdot2(z)\cdot1=\cos(4)\cdot 4 \cdot1=4\cos(4) $$
This is called the forward propagation! Calculate from input to output.
In this section, we introduce a simple neural network that utilizes a squared loss function. We aim to compute the gradients with respect to two weights, $w_1$ and $w_2$, represented as $\left[\frac{\partial{J}}{\partial{w_1}} \quad \frac{\partial{J}}{\partial{w_2}}\right]$, through the framework of a computation graph as shown below.
First, we'll compute the derivative of $J$ with respect to the weight $w_1$, which involves the chain rule calculation:
$$ \frac{\partial{J}}{\partial{w_1}}=\frac{d{J}}{d{\hat{y}}}\cdot \frac{d{\hat{y}}}{d{r}}\cdot \frac{\partial{r}}{\partial{h}}\cdot \frac{dh}{dz}\cdot \frac{dz}{dw_1} $$