Up to now, our Bayesian story has been completely general: we define a clean, deterministic mechanism $z=f_\theta(x)$, then treat the real observation $y$ as noisy data sampled from some distribution $\mathcal{D}$ whose parameters are set by $z(x,\theta)$.

To make this tractable, we now choose a pairing where the prior and likelihood “fit together” algebraically. The result is one of the few supervised learning models where the full posterior is available in closed form: Bayesian Linear Regression.

Model: a linear “clean mechanism” with Gaussian measurement noise

We assume the clean output is linear in the parameters:

$$ z=f_\theta(x)=x^\top \theta $$

Then we model our real measurement as Gaussian noise around that clean mechanism:

$$ y \mid x,\theta \sim \mathcal{N}(z,\sigma^2)=\mathcal{N}(x^\top \theta,\sigma^2) $$

This is exactly the “clean physics + messy measurement” split, but now with a specific $\mathcal{D}$ (a Normal distribution).

Dataset notation and matrix form

Let the training set be ${(x^{(i)},y^{(i)})}_{i=1}^N$, with $x^{(i)}\in\mathbb{R}^d$ and $\theta\in\mathbb{R}^d$.

Stack inputs row-wise into a design matrix $x\in\mathbb{R}^{N\times d}$ and outputs into a vector $y\in\mathbb{R}^N$. Then the full likelihood becomes:

$$ p(y\mid x,\theta)=\mathcal{N}(y\mid x\theta,\sigma^2 I) $$

Prior: Gaussian belief over weights

We encode our belief about plausible parameters using a Gaussian prior:

$$ \theta \sim \mathcal{N}(\mu_0,\Sigma_0) \quad\Longrightarrow\quad p(\theta)=\mathcal{N}(\theta\mid \mu_0,\Sigma_0) $$

This is the same “weights shouldn’t be wildly large” intuition that later shows up as L2-style shrinkage when we do MAP.

Posterior: Gaussian × Gaussian ⇒ Gaussian

Bayes’ theorem tells us:

$$ p(\theta\mid x,y)\propto p(y\mid x,\theta)p(\theta) $$

Because both the likelihood and prior are Gaussian in $\theta$, the posterior is also Gaussian: