In any supervised machine learning problem, we are trying to learn a relationship between an input $x$ and an output $y$ using some parameters $\theta$.
A fully Bayesian approach separates the "perfect" underlying physics of the world from the messy reality of our measurements. We define the underlying, clean, deterministic model output as $z$:
$$ z = f_\theta(x) $$
Here, $z$ represents the exact, uncorrupted mechanism. If we knew the true parameters $\theta$ perfectly, and the universe was flawless, $z$ is exactly what we would observe.
However, we rarely observe this pure $z$ directly in the real world. Instead, we model our actual observation $y$ as a probabilistic outcome, drawn from some distribution $\mathcal{D}$ that is governed by our clean model output $z(x, \theta)$:
$$ y|x, \theta \sim \mathcal{D}(z(x,\theta)) $$
Because the relationship between our model and our observations is probabilistic, the Bayesian objective is to never settle for a single, definitive estimate of $\theta$. Instead, we want to capture the full picture of our uncertainty by finding the complete posterior distribution $p(\theta|x, y)$ using Bayes' theorem:
$$ \overbrace{p(\theta|x, y)}^{\text{Posterior}}=\frac{p(y|x, \theta) p(\theta)}{p(y|x)} \propto p(y|x, \theta) p(\theta) $$
By using the entire distribution of $\theta$, we can capture a complete landscape of uncertainty.
Once we have our posterior distribution $p(\theta|x, y)$, the goal is to leverage it to make predictions for new, unseen data. Given a new input $x'$, we want to infer the corresponding output $y'$.
In a purely Bayesian framework, we do not collapse our findings to predict a single point value for $y'$. Instead, we formulate the posterior predictive distribution, denoted as $p(y'|x', x, y)$. This represents our belief about a new observation $y'$, conditioned not just on the new input $x'$, but on everything we learned from our training data $(x, y)$.
To construct this predictive distribution, we must utilize the entire posterior landscape we mapped out earlier. Rather than discarding our uncertainty by picking one "best" set of parameters, we integrate (marginalize) over the entire parameter space of $\theta$:
$$ p(y'|x', x, y) = \int p(y'|x', \theta) p(\theta|x, y) d\theta = \mathbb{E}_{\theta\sim \text{Posterior}}[p(y'|x', \theta)] $$
This integral acts as an infinite, weighted ensemble of models. Conceptually, it works like this:
This formulation elegantly merges two distinct sources of uncertainty into our final prediction: