Now that we've established the sense that our neural network is capable to perform as a universal approximator and gained insight into the fundamental principles of optimization algorithms, specifically "gradient descent," we can proceed to explore the optimization of our first neural network. Our journey begins by framing neural network training as an optimization problem.

Neural Network Training as an Optimization Problem

Neural network training can be modeled and realized using computational graphs that include both the network's structure and a loss function. The aim of training a neural network is to find the optimal weights, e.g., $w_1$ and $w_2$ that minimize the difference between the predicted output $\hat{y}(x)$ and the actual label $y$. This process can be visually represented through the computational graph below:

embed (74).svg

The optimization problem for neural network training focuses on minimizing the loss value $J$, which quantifies the error between the prediction $\hat{y}=f(x;\theta)$ and the label $y$. For this toy example, the optimization modelling can be expressed as:

$$ \begin{aligned} \arg\min_{w_1, w_2} &\quad L\left(f(x;w_1, w_2), y\right) \\ \text{e.g.,} \quad \arg\min_{w_1, w_2} &\quad (y-\sigma(w_2\sigma(w_1x)))^2 \end{aligned} $$

In this completely unfolded version, the function explicitly reveals the neural network's activation functions $\sigma$, the weights $w_1$ and $w_2$, and the squared error function. This essentially means by turning the "knobs" $w_1$ and $w_2$, we can adjust the output $\hat{y}$ (influenced by these knobs) to be as close as possible to the desired label $y$.

By Hand: Calculate Gradient for a Simplified Model

With the toy neural network in hand, we aim to optimize $w_1$ and $w_2$ by minimizing the squared error loss, denoted as $J(w_1, w_2)$. This process involves calculating the gradients of the loss function with respect to both weights, a task simplified by applying the chain rule.

embed (75).svg

The chain rule breaks down the complex derivative calculation into manageable steps by defining intermediate variables:

  1. Input to First Layer: $z = w_1x$
  2. Activation of First Layer: $h = \sigma(z)=\sigma(w_1x)$
  3. Input to Second Layer: $u = w_2h=w_2\sigma(w_1x)$
  4. Activation of Second Layer: $\hat{y} = \sigma(u)=\sigma(w_2\sigma(w_1x))$
  5. Loss Function: $J=(\hat{y} - y)^2=(\sigma(w_2\sigma(w_1x))-y)^2$

The derivative of the loss $J$ with respect to $w_1$ involves a series of multiplications of derivatives of these intermediate functions using the chain rule:

$$ \begin{aligned}\frac{\partial J}{\partial w_1} &= \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial u} \cdot \frac{\partial u}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial w_1}\\&= 2(\hat{y}-y)\cdot \sigma'(u)\cdot w_2\cdot \sigma(z)\cdot x \\&= 2(\sigma(w_2\sigma(w_1x)) - y) \cdot \sigma'(w_2\sigma(w_1x)) \cdot w_2 \cdot \sigma'(w_1x) \cdot x \end{aligned} $$

This formula captures the essence of backpropagation by computing the gradient step by step from the output layer back to the input layer, effectively updating $w_1$ and $w_2$ to minimize the loss $J$.

The process for calculating $\frac{\partial J}{\partial w_2}$ follows a similar logic,