Cross-Entropy for Multi-class Classification

Cross-entropy is a crucial concept in multi-class classification tasks, often used as a loss function to optimize models. It measures the inefficiency of predicting class probabilities relative to the true distribution.

Information Theory Definition

Cross-entropy, denoted as $H(p, q)$, quantifies the expected logarithmic discrepancy between the predicted probability distribution $q(y)$ and the true distribution $p(y)$.

It is defined as:

H(p, q) = -\sum_{y} p(y) \log q(y)

This metric, known as cross-entropy, quantifies the average number of bits required to encode events from the true distribution $p(y)$ using a coding scheme optimized for a predicted distribution $q(y)$, where in machine learning models, it is parametric, written as $q_\theta(y)$.

Note: You might more commonly see the formula written as $H(p, q) = -\sum_{x} p(x) \log q(x)$ . However, in the context of loss function, we are discussing the output variables, so we use $y$ instead of $x$ for consistency.

Analogy: This scenario is akin to constructing a Huffman coding dictionary based on the frequency distribution of elements in one file (File A) and then using this dictionary to compress a different file (File B). The inefficiency or extra bits required in such a case highlight the discrepancy between the two distributions.

Relation to KL Divergence

Cross-entropy is closely related to the Kullback-Leibler (KL) Divergence, $D_{KL}(p \parallel q)$, which measures the extra bits required beyond the optimal entropy to encode samples from $p$ using $q$. The relationship can be expressed as:

$$ D_{KL}(p \parallel q) = H(p, q) - H(p) $$

Here, $H(p)$ is the entropy of the true distribution and represents a constant with respect to the optimization process. This makes $D_{KL}(p \parallel q)$ a critical component in understanding deviations from the ideal model predictions.

Relation to NLL Framework

In multi-class classification with one-hot encoded targets, the true probability distribution $p(y)$ is 1 for the correct class $y=k$, and 0 for all other classes $y \neq k$. Thus, the cross-entropy loss, which is defined as $-\sum_{y} p(y) \log q(y)$, reduces to $-\log q_\theta(y=k)$. This implies the loss depends solely on the predicted probability $q_\theta(y=k)$ for the correct class $k$.

Similarly, the negative log-likelihood (NLL) loss uses the one-hot encoding directly, expressed as $- \sum_{k=1}^{K} \mathbf{1}{y = k} \log(\hat{y}_k)$. For a correct class $y=k$, this simplifies to $-\log (\hat{y}_k)$, since all terms where $y \neq k$ become zero, with $\hat{y}k$ being equivalent to $q\theta(y=k)$ in this context.

Therefore, cross-entropy and NLL are essentially equivalent in classification scenarios. However, their interpretations differ:

NLL focuses on find the best probability density "parameters" $\hat{\mathbf{y}}$ to maximize the likelihood of the observed true class $y$,
Cross-entropy minimizes the distribution divergence between the predicted $\hat{y}$ and the true $y$.

Pytorch Definition

In PyTorch, cross-entropy is implemented in a way that combines several steps into a single, efficient computation, specifically designed for classification tasks. Therefore, it does NOT strictly follow the math definition above!!!