In the previous coin-toss example, we made a major assumption: we were flipping the exact same coin every single time. That’s why our parameter $z$ (the probability of landing on heads) was a fixed, static number.
But the real world is rarely that simple. What if the coin changes every time we flip it? What if we are trying to predict outcomes where the underlying probability depends entirely on the specific situation at hand?
This brings us to Conditioned Likelihood, which is the beating heart of how neural networks handle probability.
Let’s step away from coins and look at a scenario that might feel a bit more familiar: predicting whether a student will pass or fail a final exam.
If we just looked at the historical pass rate of the class, we might say $z = 0.7$ (a 70% chance of passing). But that’s a fixed parameter. It assumes every single student has the exact same 70% chance of passing, regardless of their effort.
We know that’s not true! A student's chance of passing is conditioned on other factors, like how many hours they studied. Let’s call "hours studied" our input data, $x$.
Now, our probability parameter $z$ is no longer a fixed number. It is a dynamic value that changes depending on the input $x$.
If $z$ changes based on $x$, we need a mathematical function to calculate $z$ for every new student. This is exactly what a neural network does.
A neural network acts as a complex function that takes the input $x$ (study hours) and transforms it into the parameter $z$ (probability of passing). But the network itself has its own internal knobs and dials—the weights and biases that define how it processes information. In machine learning, we group all of these internal network parameters under the symbol $\theta$.
Therefore, the probability $z$ is actually a function of both the student's specific input and the network's current settings:
$$ z = f_\theta(x) $$
Because our probability parameter $z$ is now generated by our model, our likelihood function needs an upgrade. We are no longer calculating the likelihood of the data given a fixed $z$. Instead, we are calculating the likelihood of observing the outcomes $\mathcal{Y}$, conditioned on the inputs $\mathcal{X}$, under the model parameters $\theta$:
$$ \mathcal{L}(\theta) = p(\mathcal{Y} \mid \mathcal{X}, \theta) $$
Let's look at two specific students to see how this works in practice: