In previous discussion, we often focus on solving for the probability $p(y|\theta, \mathbf{x})$. However, why not solve for $p(\theta|y,\mathbf{x})$ instead? Expressed in natural language, our goal is to find a parameter $\theta$ that maximizes the probability of observing $y$ given the data $\mathbf{x}$. This is precisely the task addressed by the Maximum A Posteriori (MAP) estimation.
Maximum A Posteriori (MAP) extends Maximum Likelihood Estimation (MLE) by incorporating prior knowledge into statistical inference.
In essence, while MLE focuses solely on the data, MAP provides a more balanced perspective by incorporating both the data and prior knowledge.
In the Bayes Inference context, $\theta$ is considered as the underlying cause of the observations $y$. The main subject of our discussion is to infer the cause $\theta$. In the terminology of machine learning, $p(\theta|y, \mathbf{x})$ is known as the posterior distribution, and the corresponding $p(\theta)$ is called the prior distribution.
Note: in most modern machine learning models, particularly discriminative models, $\mathbf{x}$ is assumed to be a given constant, given beforehand and always positioned on the right side of the conditioning bar.
We use Bayes' theorem to describe the relationship between the posterior and the prior:
$$ \underbrace{p(\theta|y, \mathbf{x})}{\text{posterior probability}} = \frac{\overbrace{p(y|\theta, \mathbf{x})}^{\text{likelihood}} \times \overbrace{p(\theta)}^{\text{prior}}}{\underbrace{p(y|\mathbf{x})}{\text{marginal likelihood}}} $$
In MLE, the primary focus is on the likelihood term $p(y|\theta, \mathbf{x})$. When the distribution of the training data matches the actual data well, the likelihood estimation can accurately estimate $\theta$. However, if the training data is biased, the estimated $\theta$ performs well only on the training data and has poor applicability to real-world data. The prior $p(\theta)$ can partially correct this bias, providing an initial belief, especially in cases of insufficient or biased data.
In Bayesian statistics, a reasonable prior not only provides a starting point for parameter estimation but also helps correct the bias introduced by training data, enhancing the model's ability to generalize to new data. A strong prior might dominate the posterior probability, reducing the influence of the data, which is suitable for scenarios with limited or low-quality data. However, an overly strong prior can make the model insensitive to true signals in the data, so the choice of prior must be made carefully.
In practice, the selection of priors should be based on domain knowledge or prior experience. For instance, in the case of coin tossing, if we suspect that the coin might be biased, we could choose a prior deviating from $0.5$, such as $\theta \sim \mathcal{N}(0.5, 0.1)$, indicating a slight bias toward one side of the coin while maintaining some uncertainty.
Optional: Another advantage of the Bayesian method is that the posterior probability can be updated continuously. With each new data point, the current posterior can serve as the prior for the next analysis, allowing us to iteratively refine our understanding of the parameters as new information becomes available.
Thus, compared to MLE, the Bayesian approach offers a more flexible and comprehensive framework to handle various uncertainties and situations of data insufficiency. It not only focuses on fitting the data but also emphasizes synthesizing prior knowledge with new data to make more reasonable inferences.
The core objective of MAP is to optimize the posterior distribution:
$$ \argmax_\theta p(\theta|y, \mathbf{x}) = \argmax_\theta \frac{p(y|\theta, \mathbf{x}) \times {p(\theta)}}{{p(y|\mathbf{x})}} $$