Maximum A Posteriori (MAP) Estimation

In previous discussion, we often focus on solving for the probability $p(y|\theta, \mathbf{x})$. However, why not solve for $p(\theta|y,\mathbf{x})$ instead? Expressed in natural language, our goal is to find a parameter $\theta$ that maximizes the probability of observing $y$ given the data $\mathbf{x}$. This is precisely the task addressed by the Maximum A Posteriori (MAP) estimation.

Intuition

Maximum A Posteriori (MAP) extends Maximum Likelihood Estimation (MLE) by incorporating prior knowledge into statistical inference.

MLE is primarily concerned with the hidden factors, i.e., the distribution parameters, a.k.a. the model output $\hat{y}$, that best explain the observed data $y$. It poses the question: "Given the parameters, how probable is the observed data?" However, MLE does not account for any pre-existing beliefs or information about the distribution parameters before observing the data. That is to say, the data can be biased.
In contrast, MAP adopts a more comprehensive approach by combining the likelihood of the observed data with prior beliefs or knowledge about the parameters. It does so by multiplying the likelihood by a prior distribution, embedding preconceived notions or assumptions into the estimation process. This integration of belief and observation enables MAP to infer distribution parameters $\hat{y}$ more effectively, providing nuanced estimates—especially valuable with limited data or substantial, credible prior information.

In essence, while MLE focuses solely on the data, MAP provides a more balanced perspective by incorporating both the data and prior knowledge.

From Bayes Theorem to MAP

In the Bayes Inference context, $\theta$ is considered as the underlying cause of the observations $y$. The main subject of our discussion is to infer the cause $\theta$. In the terminology of machine learning, $p(\theta|y, \mathbf{x})$ is known as the posterior distribution, and the corresponding $p(\theta)$ is called the prior distribution.

Note: in most modern machine learning models, particularly discriminative models, $\mathbf{x}$ is assumed to be a given constant, given beforehand and always positioned on the right side of the conditioning bar.

Bayesian Theorem

We use Bayes' theorem to describe the relationship between the posterior and the prior:

$$ \underbrace{p(\theta|y, \mathbf{x})}{\text{posterior probability}} = \frac{\overbrace{p(y|\theta, \mathbf{x})}^{\text{likelihood}} \times \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(y|\mathbf{x})}{\text{marginal likelihood}}} $$

In MLE, the primary focus is on the likelihood term $p(y|\theta, \mathbf{x})$. When the distribution of the training data matches the actual data well, the likelihood estimation can accurately estimate $\theta$. However, if the training data is biased, the estimated $\theta$ performs well only on the training data and has poor applicability to real-world data. The prior $p(\theta)$ can partially correct this bias, providing an initial belief, especially in cases of insufficient or biased data.

In Bayesian statistics, a reasonable prior not only provides a starting point for parameter estimation but also helps correct the bias introduced by training data, enhancing the model's ability to generalize to new data. A strong prior might dominate the posterior probability, reducing the influence of the data, which is suitable for scenarios with limited or low-quality data. However, an overly strong prior can make the model insensitive to true signals in the data, so the choice of prior must be made carefully.

In practice, the selection of priors should be based on domain knowledge or prior experience. For instance, in the case of coin tossing, if we suspect that the coin might be biased, we could choose a prior deviating from $0.5$, such as $\theta \sim \mathcal{N}(0.5, 0.1)$, indicating a slight bias toward one side of the coin while maintaining some uncertainty.

Optional: Another advantage of the Bayesian method is that the posterior probability can be updated continuously. With each new data point, the current posterior can serve as the prior for the next analysis, allowing us to iteratively refine our understanding of the parameters as new information becomes available.

Thus, compared to MLE, the Bayesian approach offers a more flexible and comprehensive framework to handle various uncertainties and situations of data insufficiency. It not only focuses on fitting the data but also emphasizes synthesizing prior knowledge with new data to make more reasonable inferences.

Maximum A Posterior Estimation

The core objective of MAP is to optimize the posterior distribution:

$$ \argmax_\theta p(\theta|y, \mathbf{x}) = \argmax_\theta \frac{p(y|\theta, \mathbf{x}) \times {p(\theta)}}{{p(y|\mathbf{x})}} $$