MAP as Parameter Regularization: The Bayesian View of L2 Regularization

Now that we understand how MAP balances data with prior beliefs, let's see how this works in practice when training a classification or regression model.

In machine learning, our parameters are the model weights, denoted as $\theta$. What is a reasonable "prior belief" to have about model weights before we see any data? Generally, we believe that weights should not be wildly large, because massive weights usually mean the model is memorizing noise (overfitting).

We can represent this belief mathematically by assuming our weights are normally distributed around zero. This is our Gaussian prior: $\theta \sim \mathcal{N}(0, \sigma^2)$.

Reformulating the Objective

Let's plug this prior into our MAP framework. We want to find the model parameters $\theta$ that maximize the posterior probability:

$$ \theta_{\text{MAP}} = \arg\max_\theta {p(y \mid \theta, x) p(\theta)} $$

In machine learning, it is standard practice to minimize a loss function rather than maximize a probability. We can flip this maximization problem into a minimization problem by taking the negative natural logarithm ($-\log$). Because logarithms turn multiplication into addition, our objective becomes:

$$ \theta_{\text{MAP}} = \arg\min_\theta \left[-\log p(y \mid \theta, x) - \log p(\theta)\right] $$

Term 1: $-\log p(y \mid \theta, x)$ is the Negative Log-Likelihood (NLL). This is the exact same loss function used in standard Maximum Likelihood Estimation (MLE). It measures how well the model fits the data.
Term 2: $- \log p(\theta)$ is the penalty applied by our prior belief.

Unpacking the Prior Penalty

Let's look closer at that second term. The formula for our Gaussian (Normal) prior distribution is:

$$ \log p(\theta) = \log\left(\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{||\theta||^{2}}{2\sigma^{2}}\right)\right) $$

Using standard logarithm rules, we can break this into two separate pieces:

$$ \log p(\theta) = \log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) - \frac{||\theta||^{2}}{2\sigma^{2}} $$

Notice that the first piece, It contains no $\theta$,$\theta$$\log(\frac{1}{\sigma\sqrt{2\pi}})$, is just a constant. Since adding or subtracting a constant doesn't change where the minimum of a function is located, we can safely ignore it during optimization.

The "Aha!" Moment: Connecting to L2 Regularization

When we drop the constant and substitute our simplified prior back into the main objective, we get:

$$ \theta_{\text{MAP}} = \arg\min_\theta \left(J_{\text{NLL}}(\theta) + \frac{||\theta||^{2}}{2\sigma^{2}}\right) $$

(Note: $J_{\text{NLL}}(\theta)$ is simply shorthand for our standard NLL loss).