As we established, our neural network acts as a function parameterized by its internal weights, $\theta$. For any given input data ${x}^{(i)}$, the network processes it and outputs a specific parameter $z^{(i)}$:
$$ z^{(i)} = f_\theta({x}^{(i)}) $$
In machine learning terminology, you will frequently see this output $z$ referred to as $\hat{y}$ (pronounced "y-hat"). This $\hat{y}$ is the model's prediction.
More accurately, in the context of binary classification, $z$ represents the specific probability that the observed outcome belongs to the positive class (usually denoted as $y=1$).
Therefore, the relationship between our model output and the probability is direct:
$$ p(y^{(i)}=1 \mid x^{(i)}, \theta) = z^{(i)} $$
And naturally, since probabilities must sum to 100%, the probability of the opposite outcome ($y=0$) is simply the remainder:
$$ p(y^{(i)}=0 \mid x^{(i)}, \theta) = 1 - z^{(i)} $$
To use our Negative Log-Likelihood (NLL) formula, we need a single mathematical expression for $p(y^{(i)} \mid x^{(i)}, \theta)$, rather than two separate "if/then" rules for the 1s and 0s.
A brilliant mathematical trick to unify these probabilities is to use the actual observed target value, $y^{(i)}$, as an exponent.
Since in binary classification $y^{(i)}$ can only ever take the value of exactly $0$ or $1$, we can write the combined probability formulation like this:
$$ p(y^{(i)} \mid x^{(i)}, \theta) = \left(z^{(i)}\right)^{y^{(i)}} \left(1 - z^{(i)}\right)^{(1 - y^{(i)})} $$
Let's break down why this elegant, concise form works perfectly:
Now, we can take this unified probability expression and substitute it directly into our NLL framework from the previous section:
$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \left(z^{(i)}\right)^{y^{(i)}} \left(1 - z^{(i)}\right)^{(1 - y^{(i)})} \right] $$
To simplify this into its final, computable form, we just apply standard logarithm rules. First, the log of a product becomes the sum of the logs: