NLL Modelling for Ordinal Regression (Cumulative Link)

In ordinal regression, we aim to predict a categorical target where the classes have a strict, meaningful rank, but the mathematical distance between them is unknown. The target is an ordered discrete label $y^{(i)} \in \{1, 2, \dots, K\}$ where $1 \prec 2 \prec \dots \prec K$.

Applications of Ordinal Regression

Unlike standard regression (which assumes equal spacing) or multi-class classification (which assumes no relationship between classes), ordinal regression explicitly respects the rank. Common real-world examples include:

Healthcare: Predicting a patient's self-reported pain level (None, Mild, Moderate, Severe) or the staging of a disease (Stage I, II, III, IV).
Customer Feedback: Estimating user satisfaction via Likert scales (Strongly Disagree, Disagree, Neutral, Agree, Strongly Agree).
Finance & Economics: Forecasting corporate credit ratings (AAA, AA, A, BBB, etc.) where the risk increases categorically.
Education: Modeling student performance brackets or letter grades (A, B, C, D, F).

Definition of the Ordinal Framework (Latent Variable Theory)

Ordinal data cannot be modeled using a single standard probability distribution like a Gaussian or Poisson. Instead, it is modeled using a framework based on an unobservable, underlying continuous value known as a latent variable, $y^*$.

We divide this continuous latent space using $K-1$ strictly ordered thresholds (or cutpoints), $\delta$. If we have $K$ categories, the thresholds are $\delta_1 < \delta_2 < \dots < \delta_{K-1}$. The observed ordinal class $y$ is determined by where the continuous $y^*$ falls relative to these cutpoints:

$$ y = k \quad \text{if} \quad \delta_{k-1} < y^* \leq \delta_k $$

(Where the outer boundaries are theoretically defined as $\delta_0 = -\infty$ and $\delta_K = \infty$).

Because the model learns the distances between these $\delta$ thresholds during training, it entirely preserves the "comparable but not operable" rule of ordinal data.

Conditioned Likelihood

To model this probabilistically, instead of predicting the exact class directly, our neural network models the cumulative probability—the probability that the target is strictly greater than a specific class $k$.

For a $K$-class problem, the network processes the input $x$ and outputs $K-1$ logits, representing our sequence of threshold decisions: $z^{(i)} = \{z_1^{(i)}, z_2^{(i)}, \dots, z_{K-1}^{(i)}\} = f_\theta(x^{(i)})$.

We apply the Sigmoid function $\sigma(\cdot)$ to these logits to obtain the conditioned probability that the actual class is greater than $k$:

$$ P(y^{(i)} > k \mid x^{(i)}, \theta) = \sigma(z_k^{(i)}) = \frac{1}{1 + e^{-z_k^{(i)}}} $$

Deriving the Final Loss Function

To compute the likelihood of observing our target $y^{(i)} = c$, we must format the true class $c$ into a binary vector $t^{(i)}$ of length $K-1$. This vector represents the ground-truth answers to the sequential threshold questions: