In our previous discussion on Maximum A Posteriori (MAP) estimation, we saw how incorporating prior beliefs helps anchor our models against the noise of small or biased datasets.
However, MAP shares a fundamental limitation with Maximum Likelihood Estimation (MLE): they both ultimately produce a single, solitary answer.
MAP is a point estimation. It finds the absolute peak (the mode) of the posterior distribution and discards the rest. If the posterior distribution is a vast, rolling mountain range of possibilities, MAP simply plants a flag at the highest peak and ignores the shape of the mountain.
Bayesian Estimation, on the other hand, is about embracing the entire landscape. Instead of asking "What is the single most probable parameter?", Bayesian estimation asks,
"What is the full range of plausible parameters, and how likely is each one?" It retains the entire posterior distribution, carrying forward a complete picture of our uncertainty.
The core concept of pure Bayesian estimation is that we never collapse our beliefs down to a single, absolute estimate for a parameter $z$. Instead, we treat $z$ as a random variable and use the entire posterior distribution $p(z|y)$ to represent our updated uncertainty after seeing the data $y$. This complete distribution then serves as the foundation for all downstream tasks, preserving our uncertainty rather than discarding it.
We mathematically define this update using Bayes' theorem:
$$ p(z|y)=\frac{p(y|z) p(z)}{p(y)}\propto p(y|z) p(z) $$
The chain of equality and proportionality highlights a crucial practical shortcut in Bayesian inference.
Note: When utilizing conjugate priors, Bayesian inference becomes significantly more tractable. Since the posterior distribution is known to belong to the same probability family as the prior, we do not need to compute the full equation $p(z|y)=\frac{p(y|z) p(z)}{p(y)}$. Instead, we can simply multiply the likelihood and the prior, $p(y|z) p(z)$, to algebraically identify the updated distribution parameters of the posterior. The marginal likelihood, $p(y)$, can be safely ignored during this step as it is independent of $z$ and functions purely as a normalizing constant.
The fully Bayesian approach computes the predictive distribution by calculating a weighted average over every possible value of $z$, weighted by how probable that $z$ is according to our posterior. Mathematically, this requires integration:
$$ p(y'|y) = \int p(y'|z) p(z|y) dz $$
The integral sums this up across all possible values of $z$.