As previously assumed, the model's output, $\hat{y}^{(i)}$, represents the parameters of a distribution that models the ground truth label, $y^{(i)}$.
Now, Let us apply this framework to gain insight into regression.
Using the linear regression as an example, the predicted value $\hat{y}^{(i)}$ for each instance $i$ is expressed as:
$$ \hat{y}^{(i)} = \mathbf{w}\cdot\mathbf{x}^{(i)}+b. $$
The prediction output $\hat{y}^{(i)}$ is assumed to represent the mean of a Gaussian distribution, with the observed label $y^{(i)}$ following this distribution:
$$ y^{(i)} \sim \mathcal{N}(\hat{y}^{(i)}, \sigma^2) $$
In regression analysis, it is commonly assumed that the variance $\sigma^2$ remains constant.
The probability density function for each observed label $y^{(i)}$ is determined by the mean $\hat{y}^{(i)}$, and variance $\sigma$:
$$ p\left(y^{(i)}|\mu=\hat{y}^{(i)},\sigma\right) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right). $$
Under our model, the MLE of observing the series of labels $y_i$ can be written as:
$$ \argmax_{\theta}\prod_{i=1}^{N} \frac{1}{\sigma \sqrt{2 \pi}} \exp\left(-\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right).
$$
Here $\hat{y}^{(i)}=f_\theta(\mathbf{x}^{(i)})$. Following the NLL framework, we derive the loss optimization objective:
$$ \argmin_{\theta}
Using the property $\log(ab) = \log a + \log b$, we decompose the large expression within the square brackets into two separate terms:
$$ \argmin_{\theta}
We can safely ignore the constant term $\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right)$ since summing a constant merely shifts the function vertically without impacting the minima. This simplification results in:
$$ \argmin_{\theta}
Because $\log$ and $\exp$ cancel each other, this further reduces to:
$$ \argmin_{\theta} \color{ff0000}- \color{000000}\sum_{i=1}^{N} \left[\color{ff0000}-\color{000000}\frac{1}{2}\left(\frac{y^{(i)} - \hat{y}^{(i)}}{\sigma}\right)^2\right] . $$