Now that we've established the sense that our neural network is capable to perform as a universal approximator and gained insight into the fundamental principles of optimization algorithms, specifically "gradient descent," we can proceed to explore the optimization of our first neural network. Our journey begins by framing neural network training as an optimization problem.

Calculate Gradient for a Simplified Model

With the toy neural network in hand, we aim to optimize $w_1$ and $w_2$ by minimizing the squared error loss, denoted as $\ell(w_1, w_2)$. This process involves calculating the gradients of the loss function with respect to both weights, a task simplified by applying the chain rule.

$$ f(x)=\sigma\left(w_2\sigma\left(w_1x\right)\right) $$

embed (67).svg

The chain rule breaks down the complex derivative calculation into manageable steps by defining intermediate variables:

  1. Input to First Layer: $z = w_1x$
  2. Activation of First Layer: $h = \sigma(z)=\sigma(w_1x)$
  3. Input to Second Layer: $u = w_2h=w_2\sigma(w_1x)$
  4. Activation of Second Layer: $\hat{y} = \sigma(u)=\sigma(w_2\sigma(w_1x))$
  5. Loss Function: $L=(\hat{y} - y)^2=(\sigma(w_2\sigma(w_1x))-y)^2$

The derivative of the loss $L$ with respect to $w_1$ involves a series of multiplications of derivatives of these intermediate functions using the chain rule:

$$ \begin{aligned}\frac{\partial L}{\partial w_1} &= \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial u} \cdot \frac{\partial u}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial w_1}\\&= 2(\hat{y}-y)\cdot \sigma'(u)\cdot w_2\cdot \sigma(z)\cdot x \\&= 2(\sigma(w_2\sigma(w_1x)) - y) \cdot \sigma'(w_2\sigma(w_1x)) \cdot w_2 \cdot \sigma'(w_1x) \cdot x \end{aligned} $$

This formula captures the essence of backpropagation by computing the gradient step by step from the output layer back to the input layer, effectively updating $w_1$ and $w_2$ to minimize the loss $L$.

The process for calculating $\frac{\partial L}{\partial w_2}$ follows a similar logic,

$$ \frac{\partial{L}}{\partial{w_2}}=2(\sigma\left(w_2\sigma\left(w_1x\right)\right) - y)\cdot(\sigma'\left(w_2\sigma\left(w_1x\right)\right))\cdot(\sigma\left(w_1x\right)) $$

You might have already observed the repeated computations that pertain to the implementation of backpropagation using dynamic programming techniques.

Calculate Gradient for Dataset

In real machine learning problems, models are typically trained on extensive datasets rather than single data points. This requires adjusting our optimization strategy to account for the entire dataset by optimizing the average loss across all samples $(x^{(i)},y^{(i)})$ in the dataset, rather than the loss for a single sample. The optimization objective becomes:

$$ \min_{w_1, w_2} \frac{1}{N}\sum_{i=1}^{N}{L\left(\sigma\left(w_2\sigma\left(w_1x^{(i)}\right)\right), y^{(i)}\right)} $$