In this section, we will delve into the Bayesian estimation framework and explore regularization techniques, particularly focusing on their application in linear regression models. We'll discuss the implications of choosing different model complexities and how regularization can help prevent overfitting, using the maximum likelihood estimation (MLE) approach as our foundation.

Model Complexity in Linear Regression

Linear regression is a fundamental statistical method used to model the relationship between a dependent variable and one or more independent variables. The choice of model complexity, such as deciding between a simple linear model $y = \theta_0 + \theta_1x$ and a more complex polynomial model $y = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3$, is crucial in preventing overfitting and ensuring good generalization.

Consider an example where we fit polynomial models of different orders to a dataset:

Untitled

Fitting a high-order polynomial, like a 5th-order one, may yield a model that fits the training data very well but performs poorly on unseen data. This overfitting issue is where regularization techniques come into play.

Understanding Regularization

Regularization is a technique used to reduce the model complexity and prevent overfitting by introducing a penalty term to the loss function during training. This penalty term penalizes large coefficients and helps the model to generalize better to new, unseen data. The two most common types of regularization are:

The regularization term for L0 regularization, aiming to make the model parameters $\theta$ as sparse as possible, is given by:

$$ Q(\theta) = J(y, \hat{y}(\theta)) + \lambda \times \text{count of non-zero } \theta $$

However, since the count function is non-differentiable, L1 and L2 regularizations are typically used as approximations of L0 regularization.

For L2 regularization, the modified loss function becomes:

$$ J_{\text{map}}(y, \hat{y}(\theta)) = J_{\text{mle}}(y, \hat{y}(\theta)) + \lambda \sum_i \theta_i^2 = J_{\text{mle}}(y, \hat{y}(\theta)) + \lambda ||\theta||^2 $$

Maximum A Posteriori Estimation

In the Bayesian estimation framework, the Maximum A Posteriori (MAP) estimation plays a crucial role. It is defined as:

$$ \arg\max_\theta \ln p(x \mid y, \theta) p(\theta) $$

This definition signifies that we aim to find the parameter $\theta$ that maximizes the logarithm of the joint probability of the data $x$ given the parameters $y$ and the prior probability of the parameters $\theta$. However, it is often more practical to minimize the negative of this log-likelihood in optimization problems, leading us to the following equivalent formulation:

$$ \begin{aligned} \theta_{M A P} & =\arg\min_\theta\left(-\ln p(x \mid y, \theta) p(\theta)\right) \\ & =\arg\min_\theta\left(-(\ln p(x \mid y, \theta)+\ln p(\theta))\right) \\ & =\arg\min_\theta\left(-\ln p(x \mid y, \theta)-\ln p(\theta)\right) \end{aligned} $$