NLL Modelling for Regression (Real-Valued)

In standard regression tasks, we aim to predict a continuous, real-valued target $y \in (-\infty, +\infty)$.

To model this probabilistically, we assume that the target variable $y$, given the input $x$, follows a Gaussian (Normal) distribution.

$$ y^{(i)} \mid x^{(i)},\theta \sim \mathcal{N}(\mu(x^{(i)},\theta), \sigma^2) $$

Our neural network processes the input $x$ and outputs the predicted mean of this distribution, $\mu=z^{(i)}=f_\theta(x^{(i)})$. The variance $\sigma^2$ is often treated as a fixed constant for standard regression.

Deriving the Final Loss Function

The likelihood of observing our target $y$ given the predicted mean $z$ is defined by the Gaussian probability density function:

$$ p(y^{(i)} \mid x^{(i)}, \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)}-z^{(i)})^2}{2\sigma^2}\right) $$

Apply Negative Log-Likelihood (NLL)

We take the negative natural logarithm of this function.

$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)}-z^{(i)})^2}{2\sigma^2}\right) \right] $$

Using logarithm rules (the log of a product is the sum of logs, and the log of an exponential cancels out), we get:

$$ \sum_{i=1}^{N} -\log \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)}-z^{(i)})^2}{2\sigma^2}\right) \right] = \sum_{i=1}^{N}\frac{(y^{(i)}-z^{(i)})^2}{2\sigma^2} - \sum_{i=1}^{N}\frac{1}{2}\log(2\pi\sigma^2) $$

Because any terms that do not depend on the model parameters (in this case, $z$) will not affect the optimizer during training, we can safely drop them. If we assume $\sigma^2$ is a constant, the entire second term $-\frac{1}{2}\log(2\pi\sigma^2)$ is a constant.

Removing the constants leaves us with a loss function proportional to the squared difference between the target and the prediction:

$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)}-z^{(i)})^2}{2\sigma^2}\right) \right] \rightarrow \min_{\theta} \overbrace{\sum_{i=1}^{N}(y^{(i)}-z^{(i)})^2}^{\text{SSE}} $$

By minimizing the NLL under a Gaussian assumption, we naturally arrive at the Squared Squared Error (SSE) loss function. In the context of minimization, it is equivalent to the mean squared error (MSE).