NLL Modelling for Multi-class Classification

In multi-class classification, we aim to classify an input into one of several mutually exclusive categories, where the target $y^{(i)}$ belongs to a set of classes $\{1, \dots, K\}$ or is represented as a one-hot vector.

To model this probabilistically, we assume that the target variable $y$, given the input $x$, follows a Categorical distribution.

$$ y^{(i)} \mid x^{(i)},\theta \sim \text{Categorical}(z_1^{(i)}, \dots, z_K^{(i)}) $$

Our neural network processes the input $x$ and outputs a vector of predicted probabilities for each class, $z^{(i)} = f_\theta(x^{(i)})$.

For a one-hot encoded target $y_k^{(i)} \in \{0,1\}$ where $\sum_{k=1}^K y_k^{(i)} = 1, z_k^{(i)}$ represents the predicted probability that the observation belongs to class $k$.

Deriving the Final Loss Function

The likelihood of observing our one-hot target $y^{(i)}$ given the predicted probabilities $z^{(i)}$ is defined by the product of the probabilities raised to the power of their one-hot labels:

$$ p(y^{(i)} \mid x^{(i)}, \theta) = \prod_{k=1}^K \left(z_k^{(i)}\right)^{y_k^{(i)}} $$

Because only the true class has $y_k^{(i)} = 1$ (and all others are $0$), this product perfectly isolates the predicted probability of the correct class.

Apply Negative Log-Likelihood (NLL)

We take the negative natural logarithm of this function.

$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \prod_{k=1}^K \left(z_k^{(i)}\right)^{y_k^{(i)}} \right] $$

Using logarithm rules (the log of a product is the sum of logs, and exponents can be brought down as multipliers), we get:

$$ \sum_{i=1}^{N} -\log \left[ \prod_{k=1}^K \left(z_k^{(i)}\right)^{y_k^{(i)}} \right] = \sum_{i=1}^{N} \sum_{k=1}^K -y_k^{(i)} \log\left(z_k^{(i)}\right) $$

Because all terms depend on the model outputs $z_k^{(i)}$ multiplied by the active target class, there are no independent constants to drop. Minimizing this NLL leaves us directly with the Categorical Cross-Entropy (CE) loss function:

$$ \min_{\theta} \sum_{i=1}^{N} -\log \left[ \prod_{k=1}^K \left(z_k^{(i)}\right)^{y_k^{(i)}} \right] \rightarrow \min_{\theta} \overbrace{\sum_{i=1}^{N} \sum_{k=1}^K -y_k^{(i)} \log\left(z_k^{(i)}\right)}^{\text{Cross-Entropy}} $$