In any supervised machine learning problem, we are trying to learn a relationship between an input $x$ and an output $y$ using some parameters $\theta$.

A fully Bayesian approach separates the "perfect" underlying physics of the world from the messy reality of our measurements. We define the underlying, clean, deterministic model output as $z$:

$$ z = f_\theta(x) $$

Here, $z$ represents the exact, uncorrupted mechanism. If we knew the true parameters $\theta$ perfectly, and the universe was flawless, $z$ is exactly what we would observe.

However, we rarely observe this pure $z$ directly in the real world. Instead, we model our actual observation $y$ as a probabilistic outcome, drawn from some distribution $\mathcal{D}$ that is governed by our clean model output $z(x, \theta)$:

$$ y|x, \theta \sim \mathcal{D}(z(x,\theta)) $$

Because the relationship between our model and our observations is probabilistic, the Bayesian objective is to never settle for a single, definitive estimate of $\theta$. Instead, we want to capture the full picture of our uncertainty by finding the complete posterior distribution $p(\theta|x, y)$ using Bayes' theorem:

$$ \overbrace{p(\theta|x, y)}^{\text{Posterior}}=\frac{p(y|x, \theta) p(\theta)}{p(y|x)} \propto p(y|x, \theta) p(\theta) $$

By using the entire distribution of $\theta$, we can capture a complete landscape of uncertainty.

Inference Using the Posterior Predictive Distribution

Once we have our posterior distribution $p(\theta|x, y)$, the goal is to leverage it to make predictions for new, unseen data. Given a new input $x'$, we want to infer the corresponding output $y'$.

In a purely Bayesian framework, we do not collapse our findings to predict a single point value for $y'$. Instead, we formulate the posterior predictive distribution, denoted as $p(y'|x', x, y)$. This represents our belief about a new observation $y'$, conditioned not just on the new input $x'$, but on everything we learned from our training data $(x, y)$.

To construct this predictive distribution, we must utilize the entire posterior landscape we mapped out earlier. Rather than discarding our uncertainty by picking one "best" set of parameters, we integrate (marginalize) over the entire parameter space of $\theta$:

$$ p(y'|x', x, y) = \int p(y'|x', \theta) p(\theta|x, y) d\theta = \mathbb{E}_{\theta\sim \text{Posterior}}[p(y'|x', \theta)] $$

This integral acts as an infinite, weighted ensemble of models. Conceptually, it works like this:

  1. Generate a Distribution Prediction: For every possible parameter configuration $\theta$, we ask the model what it predicts for the new observation: $p(y'|x', \theta)$, i.e., since $\theta$ is a given condition, $p(y'|x', \theta)$ becomes the output (i.e. the distribution determined by the distribution parameters $z(\theta, x')$) of the model given the input $x'$.
  2. Weight the Prediction: We weight that specific prediction by how probable that parameter configuration is, which is exactly our posterior: $p(\theta|x, y)$.

This formulation elegantly merges two distinct sources of uncertainty into our final prediction: