Overfitting and Underfitting

Model complexity and regularization are central concepts in machine learning that help ensure models generalize well to unseen data. Achieving the right balance between a model that is too simple and a model that is too complex is a key challenge in developing robust predictive systems. When a model is too simple, it tends to underfit; it cannot capture the underlying patterns of the data and ends up with high error both on the training set and on new, unseen data. When a model is too complex, it risks overfitting, essentially memorizing the noise or minor fluctuations in the training set at the expense of good performance on unseen data.

Model Complexity - Overfitting and Underfitting

In many scenarios, this trade-off in complexity can be illustrated by simple methods such as linear regression. Linear regression, in its basic form, attempts to find a linear relationship between one or more independent variables and a dependent variable.

A standard linear model might look like $y = \theta_0 + \theta_1 x$, where $\theta_0$ and $\theta_1$ are parameters. While this model can be effective for straightforward data relationships, real-world data often exhibits non-linear or more intricate patterns, making a simple linear model prone to underfitting.

Untitled

To encode these complexities, one might introduce higher-order terms and turn the basic linear regression into a polynomial regression (for instance, including $x^2, x^3, \dots$). This additional complexity can allow the model to better represent data with curvature or other non-linear trends. However, if the polynomial degree becomes too large relative to the amount or nature of the data, the model may start to fit random noise rather than genuine structure, which leads to overfitting.

Constrast Underfitting vs. Overfitting:

Underfitting occurs when a model lacks complexity or training, resulting in high errors on both training and validation sets. Addressing underfitting requires increasing model complexity, improving feature engineering, or extending training.
Overfitting happens when a model learns both the data’s patterns and noise, leading to low training error but high validation error. Mitigating overfitting involves simplifying the model, applying regularization, or expanding the dataset.

To diagnose these problems effectively, one typically keeps track of the model’s performance on both training data and a separate validation set.

A common strategy is to start with a simpler model and incrementally increase its complexity, observing how the model’s error evolves.

Regularization

Regularization techniques help prevent overfitting by imposing constraints on the magnitude of the model’s parameters. The core idea is to add a penalty term to the model’s loss function during training, discouraging excessively large parameter values that can lead to overly complex decision boundaries. Several forms of regularization exist, but the two most popular are L1 regularization (also known as Lasso) and L2 regularization (also known as Ridge). L1 adds a penalty proportional to the absolute value of each parameter, promoting sparsity in the parameter vector by driving some coefficients to exactly zero. L2 adds a penalty proportional to the square of each parameter’s value, spreading the penalty more evenly across parameters and shrinking them toward zero without making them exactly zero.

A more theoretical form of regularization is L0 regularization, which directly counts the number of non-zero parameters and penalizes that count. However, the non-differentiable nature of the “count” function makes L0 regularization difficult to optimize in practice, which is why L1 and L2 are commonly used as workable approximations.

Mathematically, if $J_{\text{mle}}(y, \hat{y}(\theta))$ is the typical loss function (like mean squared error in linear regression), L2 regularization modifies it to include a sum of squared parameters:

$$ J_{\text{map}}(y, \hat{y}(\theta)) = J_{\text{mle}}(y, \hat{y}(\theta)) + \lambda \sum_i \theta_i^2. $$

Here, $\lambda$ is a regularization coefficient that controls the strength of the penalty term. Higher values of $\lambda$ force the parameters to remain smaller in magnitude, reducing overfitting at the potential risk of underfitting if the penalty becomes too strong.

In practice, achieving a good fit involves carefully tuning model complexity and regularization strength. One can experiment with different model architectures, different degrees of polynomial regression, or different regularization coefficients. Additional techniques can help improve the balance between underfitting and overfitting. For instance, early stopping monitors model performance on a validation set during training and halts the process when validation metrics stop improving. Data augmentation artificially expands the training set with modified or synthetic samples, making it harder for the model to overfit. Feature engineering and proper data preprocessing can also ensure that the model receives the most relevant and clean representations of the input data.

Take Away

The goal is to create a model that effectively captures patterns in training data while remaining robust to variations in unseen data.