In this section, we will explore the concept of regression loss using Maximum Likelihood Estimation (MLE), a fundamental aspect of statistical modeling, particularly within the context of linear regression.

Understanding the Relationship Between Output and Input

Consider a scenario where you observe an output $y$ that is believed to be linearly related to an input $x$. This relationship can be modeled as:

$$ y = wx + \epsilon $$

Here, $y$ is the observed output, $x$ is the input, $w$ is the weight or coefficient that we aim to estimate, and $\epsilon$ represents the error term. The error term $\epsilon$ is crucial as it accounts for the discrepancies between our model's predictions and the actual observed values. It captures either unmodeled effects or random noise, assuming that there are certain features pertinent to predicting $y$ that we might have omitted or that there is inherent randomness in our observations.

Typically, under the assumption that our model is correct, the error term $\epsilon$ follows a Gaussian distribution due to the Central Limit Theorem. Consider the following illustration:

Untitled

In this figure, the green dots represent the residuals $\epsilon = y - \hat{y}$. Ignoring the $x$-axis, you can observe that each column resembles a Gaussian distribution.

Assumptions on Error Distribution

Now, let's assume that the error $\epsilon$ for each sample is independently and identically distributed (IID) according to a Gaussian distribution with mean zero and variance $\sigma^2$. Mathematically, this is represented as:

$$ \epsilon \sim N(0, \sigma^2) $$

The probability density function of $\epsilon$ is then:

$$ p(\epsilon) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{\epsilon}{\sigma}\right)^2} $$

Given the linearity in $y = wx + \epsilon$, the distribution of $y$, conditioned on $x$ and $w$, mirrors the distribution of $\epsilon$:

$$ p(y | x, w) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y - wx}{\sigma}\right)^2} $$

Likelihood Function in Linear Regression

The likelihood of observing a series of outcomes $y_i$ given inputs $x^{(i)}$ for $i = 1, \ldots, N$ under our model can be expressed as:

$$ L(w) = \prod_{i=1}^{N} \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{y^{(i)} - wx^{(i)}}{\sigma}\right)^2} $$

By taking the logarithm of $L(w)$, we simplify the product into a summation, yielding the log-likelihood:

$$ l(w) = N \ln \frac{1}{\sigma\sqrt{2\pi}} - \frac{1}{2\sigma^2} \sum_{i=1}^{N} (y^{(i)} - wx^{(i)})^2 $$