Bayesian Estimation

In our previous discussion on Maximum A Posteriori (MAP) estimation, we saw how incorporating prior beliefs helps anchor our models against the noise of small or biased datasets.

However, MAP shares a fundamental limitation with Maximum Likelihood Estimation (MLE): they both ultimately produce a single, solitary answer.

MAP is a point estimation. It finds the absolute peak (the mode) of the posterior distribution and discards the rest. If the posterior distribution is a vast, rolling mountain range of possibilities, MAP simply plants a flag at the highest peak and ignores the shape of the mountain.

Bayesian Estimation, on the other hand, is about embracing the entire landscape. Instead of asking "What is the single most probable parameter?", Bayesian estimation asks,

"What is the full range of plausible parameters, and how likely is each one?" It retains the entire posterior distribution, carrying forward a complete picture of our uncertainty.

The Bayesian Objective: The Full Posterior

The core concept of pure Bayesian estimation is that we never collapse our beliefs down to a single, absolute estimate for a parameter $z$. Instead, we treat $z$ as a random variable and use the entire posterior distribution $p(z|y)$ to represent our updated uncertainty after seeing the data $y$. This complete distribution then serves as the foundation for all downstream tasks, preserving our uncertainty rather than discarding it.

We mathematically define this update using Bayes' theorem:

$$ p(z|y)=\frac{p(y|z) p(z)}{p(y)}\propto p(y|z) p(z) $$

The chain of equality and proportionality highlights a crucial practical shortcut in Bayesian inference.

Because the denominator $p(y)$ is a constant with respect to $z$—acting merely as a normalizing factor to ensure the final distribution integrates to 1—it is often mathematically intractable and computationally expensive to calculate.
By relying on the proportionality ($\propto$), we bypass the need to compute the exact denominator. This formulation shows that the exact shape and peak of our updated belief (the posterior) are entirely determined by simply multiplying our initial assumption (the prior) by the weight of the new evidence (the likelihood).

Note: When utilizing conjugate priors, Bayesian inference becomes significantly more tractable. Since the posterior distribution is known to belong to the same probability family as the prior, we do not need to compute the full equation $p(z|y)=\frac{p(y|z) p(z)}{p(y)}$. Instead, we can simply multiply the likelihood and the prior, $p(y|z) p(z)$, to algebraically identify the updated distribution parameters of the posterior. The marginal likelihood, $p(y)$, can be safely ignored during this step as it is independent of $z$ and functions purely as a normalizing constant.

Posterior Predictive Distribution

The fully Bayesian approach computes the predictive distribution by calculating a weighted average over every possible value of $z$, weighted by how probable that $z$ is according to our posterior. Mathematically, this requires integration:

$$ p(y'|y) = \int p(y'|z) p(z|y) dz $$

$p(y'|z)$ is the likelihood of the new data given a specific parameter $z$.
$p(z|y)$ is our posterior distribution (our updated belief about the parameters after seeing the training data).

The integral sums this up across all possible values of $z$.

The Bayesian Objective: The Full Posterior

Posterior Predictive Distribution

Revisiting the Coin Toss: MLE, MAP, and Fully Bayesian