As we established, our neural network acts as a function parameterized by its internal weights, $\theta$. For any given input data ${x}^{(i)}$, the network processes it and outputs a specific parameter $z^{(i)}$:

$$ z^{(i)} = f_\theta({x}^{(i)}) $$

In machine learning terminology, you will frequently see this output $z$ referred to as $\hat{y}$ (pronounced "y-hat"). This $\hat{y}$ is the model's prediction.

More accurately, in the context of binary classification, $z$ represents the specific probability that the observed outcome belongs to the positive class (usually denoted as $y=1$).

Therefore, the relationship between our model output and the probability is direct:

$$ p(y^{(i)}=1 \mid x^{(i)}, \theta) = z^{(i)} $$

And naturally, since probabilities must sum to 100%, the probability of the opposite outcome ($y=0$) is simply the remainder:

$$ p(y^{(i)}=0 \mid x^{(i)}, \theta) = 1 - z^{(i)} $$

The "Exponent Trick"

To use our Negative Log-Likelihood (NLL) formula, we need a single mathematical expression for $p(y^{(i)} \mid x^{(i)}, \theta)$, rather than two separate "if/then" rules for the 1s and 0s.

A brilliant mathematical trick to unify these probabilities is to use the actual observed target value, $y^{(i)}$, as an exponent.

Since in binary classification $y^{(i)}$ can only ever take the value of exactly $0$ or $1$, we can write the combined probability formulation like this:

$$ p(y^{(i)} \mid x^{(i)}, \theta) = \left(z^{(i)}\right)^{y^{(i)}} \left(1 - z^{(i)}\right)^{(1 - y^{(i)})} $$

Let's break down why this elegant, concise form works perfectly:

Deriving the Final Loss Function

Now, we can take this unified probability expression and substitute it directly into our NLL framework from the previous section:

$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \left(z^{(i)}\right)^{y^{(i)}} \left(1 - z^{(i)}\right)^{(1 - y^{(i)})} \right] $$

To simplify this into its final, computable form, we just apply standard logarithm rules. First, the log of a product becomes the sum of the logs: