Now that we understand how MAP balances data with prior beliefs, let's see how this works in practice when training a classification or regression model.
In machine learning, our parameters are the model weights, denoted as $\theta$. What is a reasonable "prior belief" to have about model weights before we see any data? Generally, we believe that weights should not be wildly large, because massive weights usually mean the model is memorizing noise (overfitting).
We can represent this belief mathematically by assuming our weights are normally distributed around zero. This is our Gaussian prior: $\theta \sim \mathcal{N}(0, \sigma^2)$.
Let's plug this prior into our MAP framework. We want to find the model parameters $\theta$ that maximize the posterior probability:
$$ \theta_{\text{MAP}} = \arg\max_\theta {p(y \mid \theta, x) p(\theta)} $$
In machine learning, it is standard practice to minimize a loss function rather than maximize a probability. We can flip this maximization problem into a minimization problem by taking the negative natural logarithm ($-\log$). Because logarithms turn multiplication into addition, our objective becomes:
$$ \theta_{\text{MAP}} = \arg\min_\theta \left[-\log p(y \mid \theta, x) - \log p(\theta)\right] $$
Let's look closer at that second term. The formula for our Gaussian (Normal) prior distribution is:
$$ \log p(\theta) = \log\left(\frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{||\theta||^{2}}{2\sigma^{2}}\right)\right) $$
Using standard logarithm rules, we can break this into two separate pieces:
$$ \log p(\theta) = \log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) - \frac{||\theta||^{2}}{2\sigma^{2}} $$
Notice that the first piece, $\log(\frac{1}{\sigma\sqrt{2\pi}})$, is just a constant. It contains no $\theta$. Since adding or subtracting a constant doesn't change where the minimum of a function is located, we can safely ignore it during optimization.
When we drop the constant and substitute our simplified prior back into the main objective, we get:
$$ \theta_{\text{MAP}} = \arg\min_\theta \left(J_{\text{NLL}}(\theta) + \frac{||\theta||^{2}}{2\sigma^{2}}\right) $$
(Note: $J_{\text{NLL}}(\theta)$ is simply shorthand for our standard NLL loss).