MLE for Regression

As previously assumed, the model's output, $\hat{y}^{(i)}$, represents the parameters of a distribution that models the ground truth label, $y^{(i)}$.

Now, Let us apply this framework to gain insight into regression.

Using the linear regression as an example, the predicted value $\hat{y}^{(i)}$ for each instance $i$ is expressed as:

$$ \hat{y}^{(i)} = \mathbf{w}\cdot\mathbf{x}^{(i)}+b. $$

The prediction output $\hat{y}^{(i)}$ is assumed to represent the mean of a Gaussian distribution, with the observed label $y^{(i)}$ following this distribution:

$$ y^{(i)} \sim \mathcal{N}(\hat{y}^{(i)}, \sigma^2) $$

In regression analysis, it is commonly assumed that the variance $\sigma^2$ remains constant.

Simplify MLE to MSE

The probability density function for each observed label $y^{(i)}$ is determined by the mean $\hat{y}^{(i)}$, and variance $\sigma$:

$$ p\left(y^{(i)}|\mu=\hat{y}^{(i)},\sigma\right) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right). $$

Under our model, the MLE of observing the series of labels $y_i$ can be written as:

$$ \argmax_{\theta}\prod_{i=1}^{N} \frac{1}{\sigma \sqrt{2 \pi}} \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right).

Here $\hat{y}^{(i)}=f_\theta(\mathbf{x}^{(i)})$. Following the NLL framework, we derive the loss optimization objective:

$$ \argmin_{\theta}

\sum_{i=1}^{N} \log \left[ \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right) \right]. $$

Using the property $\log(ab) = \log a + \log b$, we decompose the large expression within the square brackets into two separate terms:

$$ \argmin_{\theta}

\sum_{i=1}^{N} \left[ \log\left(\tfrac{1}{\sigma\sqrt{2\pi}}\right)

\log \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right) \right]. $$

We can safely ignore the constant term $\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right)$ since summing a constant merely shifts the function vertically without impacting the minima. This simplification results in:

$$ \argmin_{\theta}

\sum_{i=1}^{N} \log \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right) $$

Because $\log$ and $\exp$ cancel each other, this further reduces to:

$$ \argmin_{\theta} \color{ff0000}- \color{000000}\sum_{i=1}^{N} \left[\color{ff0000}-\color{000000}\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right] . $$