Up to now, our Bayesian story has been completely general: we define a clean, deterministic mechanism $z=f_\theta(x)$, then treat the real observation $y$ as noisy data sampled from some distribution $\mathcal{D}$ whose parameters are set by $z(x,\theta)$.
To make this tractable, we now choose a pairing where the prior and likelihood “fit together” algebraically. The result is one of the few supervised learning models where the full posterior is available in closed form: Bayesian Linear Regression.
We assume the clean output is linear in the parameters:
$$ z=f_\theta(x)=x^\top \theta $$
Then we model our real measurement as Gaussian noise around that clean mechanism:
$$ y \mid x,\theta \sim \mathcal{N}(z,\sigma^2)=\mathcal{N}(x^\top \theta,\sigma^2) $$
This is exactly the “clean physics + messy measurement” split, but now with a specific $\mathcal{D}$ (a Normal distribution).
Let the training set be ${(x^{(i)},y^{(i)})}_{i=1}^N$, with $x^{(i)}\in\mathbb{R}^d$ and $\theta\in\mathbb{R}^d$.
Stack inputs row-wise into a design matrix $x\in\mathbb{R}^{N\times d}$ and outputs into a vector $y\in\mathbb{R}^N$. Then the full likelihood becomes:
$$ p(y\mid x,\theta)=\mathcal{N}(y\mid x\theta,\sigma^2 I) $$
We encode our belief about plausible parameters using a Gaussian prior:
$$ \theta \sim \mathcal{N}(\mu_0,\Sigma_0) \quad\Longrightarrow\quad p(\theta)=\mathcal{N}(\theta\mid \mu_0,\Sigma_0) $$
This is the same “weights shouldn’t be wildly large” intuition that later shows up as L2-style shrinkage when we do MAP.
Bayes’ theorem tells us:
$$ p(\theta\mid x,y)\propto p(y\mid x,\theta)p(\theta) $$
Because both the likelihood and prior are Gaussian in $\theta$, the posterior is also Gaussian: