We just saw that testing different values for our parameter, $z$, gives us different likelihoods, and some clearly fit the data better than others. However, instead of just guessing and checking random values, we want a systematic way to pinpoint the exact value of $z$ that produces the absolute highest probability. This is where we move from simply selecting a good $z$ to mathematically optimizing $z$.
Maximum Likelihood Estimation (MLE) is the formal method we use to find this optimal parameter. It frames our goal as a mathematical optimization problem:
$$ \hat{z} = \underset{z}{\text{arg max}} \ \mathcal{L}(z) \quad\text{where}\quad\mathcal{L}(z) = p(\mathcal{Y}\mid z)=\prod_{i=1}^{N} p(y^{(i)}\mid z) $$
In plain English, $\hat{z}$ (pronounced "z-hat") represents our final, best estimate. We are looking for the specific value of $z$ that maximizes the likelihood function, $\mathcal{L}(z)$, given our total set of observations, $\mathcal{Y}=y^{(1)}, \cdots, y^{(N)}$.
Applying MLE to the Coin Toss Example
Let's return to our coin toss sequence: $\mathcal{Y}=[\text{heads}, \text{heads}, \text{tails}]$. We want to estimate the true probability of flipping heads, which is our unknown parameter $z$.
We already established the likelihood equation for this specific sequence by treating $z$ as an unknown variable:
$$ \mathcal{L}(z) = p(\mathcal{Y} \mid z) = z \times z \times (1-z) = z^2 - z^3 $$
Essentially, we are looking to maximize the probability of seeing this specific outcome.
To find the maximum of this function, we turn to calculus. If you picture the likelihood function as a curve on a graph,

The $q$ in the figure is actually $z$ in this section
the highest point (the maximum) will occur exactly where the curve flattens out at the very top. At this peak, the slope (or derivative) is exactly zero.
Therefore, to determine the best estimate for $z$, we take the derivative of the likelihood function with respect to $z$ and set it equal to zero:
$$ \frac{d}{dz} p(\mathcal{Y} \mid z) = \frac{d}{dz}(z^2 - z^3) = 0 $$
Solving for $z$ means finding the points where $p(\mathcal{Y} \mid z)$ reaches a maximum or minimum (or sometimes other behaviors like a plateau). Using basic calculus (the power rule), taking the derivative of $z^2 - z^3$ gives us:
$$ 2z - 3z^2 = 0 $$
Now, we simply solve for $z$ to find where the curve flattens out. We can factor out a $z$:
$$ z(2 - 3z) = 0 $$
This equation has two possible solutions (roots): $z = 0$ and $z = \frac{2}{3}$.
If we test these back in our likelihood equation, $z = 0$ corresponds to the lowest likelihood (the minimum). However, $z = \frac{2}{3}$ yields the maximum possible likelihood for our specific data. Therefore, our Maximum Likelihood Estimate is $\hat{z} = \frac{2}{3}$, making it our absolute best guess for the coin's true bias.