In our previous discussions, we focused heavily on Maximum Likelihood Estimation (MLE), which asks: "Given the parameters, how probable is the observed data?" MLE is exclusively concerned with finding the parameters $\theta$ such that the distribution parameters $z(x, \theta)$ best explain our observed data $y$.
However, MLE assumes we enter the problem with a blank slate. It does not account for any pre-existing beliefs or domain knowledge. If our observed data is small or biased, MLE will confidently overfit to that biased sample.
Maximum A Posteriori (MAP) estimation provides a more comprehensive approach. Instead of just maximizing the likelihood of the data, MAP seeks to find the parameter $\theta$ that maximizes the probability of the parameters given the data and our prior beliefs.
Since we have already established the Bayesian framework, we can express the core objective of MAP directly. MAP optimizes the posterior distribution by combining the likelihood of the observed data with our prior belief about the parameters:
$$ \arg\max_\theta p(\theta|y, x) = \arg\max_\theta {p(y|\theta, x) p(\theta)} $$
By multiplying the likelihood by a prior distribution, we embed preconceived notions into the estimation process. A strong prior might dominate the posterior probability, protecting the model from biased data, whereas an overly strong prior can make the model insensitive to true signals.
Let's look at a coin toss to see how MAP corrects for limited data. The parameter $z$ represents the probability of getting heads, meaning $z$ must fall strictly between 0 and 1.
Because $z$ is a probability, it follows a Beta distribution rather than a Gaussian distribution. The Beta distribution, denoted as $\text{Beta}(\alpha, \beta)$, is perfectly suited for this because it is bounded between $0$ and $1$. The parameters $\alpha$ and $\beta$ represent our prior pseudo-counts for heads and tails, respectively.
Let's assume we believe the coin is mostly fair, but we want to leave room for uncertainty. We might assign a prior of $\alpha=2$ and $\beta=2$, which centers our belief around $z = 0.5$.
Now, suppose we flip the coin three times and observe the following data $\mathcal{Y}$: 2 heads, 1 tail.
$$ \text{MAP} \propto \underbrace{z^2(1-z)}{\text{Likelihood}} \times \underbrace{z^{\alpha-1}(1-z)^{\beta-1}}{\text{Prior}} = z^{\alpha+1}(1-z)^{\beta} $$
Substitute our prior parameters $\alpha=2$ and $\beta=2$: $\text{MAP} \propto z^3(1-z)^2$
Finding the peak (the mode) of this new distribution yields a MAP estimate of $z = 3/5$.
Notice how MAP pulls the estimate away from the extreme MLE calculation ($2/3$) and closer to our prior belief of a fair coin ($1/2$). This "pull" is the essence of regularization.