The likelihood function, $\mathcal{L}$, is defined as the probability/likelihood of observing a specific set of data, $\mathcal{Y}$, under a probability distribution parameterized by $z$:
$$ \mathcal{L}(z) = p(\mathcal{Y}\mid z) $$
Where:
Example: Bernoulli Distribution (Coin Toss)
Suppose we have a sequence of independent coin tosses:
$$ \mathcal{Y}=[\text{heads}, \text{heads}, \text{tails}] $$
To model this, we treat each toss as a Bernoulli trial. The parameter $z$ represents the probability of the coin landing on heads. Therefore, the probabilities for single events are directly tied to this parameter:
$$ p(y^{(i)} = \text{heads} \mid z) = z \quad p(y^{(i)} = \text{tails} \mid z) = 1 - z $$
Because the coin tosses are independent events, the joint probability of the entire sequence is simply the product of their individual probabilities:
$$ \mathcal{L}(z) = p(\mathcal{Y}\mid z) = p(y^{(1)}=\text{heads}\mid z) \cdot p(y^{(2)}=\text{heads}\mid z) \cdot p(y^{(3)}=\text{tails}\mid z) $$
We can rewrite this likelihood strictly in terms of the distribution parameter $z$:
$$ \mathcal{L}(z) = z \cdot z \cdot (1 - z) = z^2(1 - z) $$
If we assume the coin is biased with the parameter $z = 0.8$ (meaning an 80% chance of heads and a 20% chance of tails), we can plug this value in to compute the overall likelihood of observing this exact sequence:
$$ p(\mathcal{Y} \mid z) = 0.8 \times 0.8 \times 0.2 = 0.128 $$
In the previous example, we assumed we already knew the coin's bias ($z = 0.8$). But in real-world scenarios, we observe the data first and have to work backward to figure out the parameter.
Suppose we observe the sequence $\mathcal{Y}=[\text{heads}, \text{heads}, \text{tails}]$, but we don't know the true value of $z$. How do we decide which parameter best describes our coin?
Let's test three different hypotheses for our cause, $z$, using our likelihood formula $\mathcal{L}(z) = z^2(1 - z)$: