Loss Function and Metric

During the validation phase, you might see some metrics, for example, in image quality measurement, there are metrics like PSNR and SSIM. For object detection, there are metrics like Mean Average Precision (mAP). Sometimes, even humans are involved in validation to generate subjective scores, such as opinion average.

We might wonder why these metrics are rarely used in neural network training. The reason is simple, because most of these functions are not differentiable, and difficult to optimize. Usually, they are not like gradient routing and can hardly be resolved via sub-gradients. Below, we will see that some metrics are purely discontinuous, which entirely breaks the gradient back propagation.

Precision as an Example

Let’s think about the precision metric $\text{TP}/(\text{TP}+\text{FP})$. The core difficulty in directly optimizing precision using gradient-based methods lies in its dependence on a threshold function that classifies model outputs into binary predictions (0 or 1).

$$ \text{predictions} = \begin{cases} 1 & \text{if }\hat{y} > 0.5 \\ 0 & \text{otherwise} \end{cases}\\\text{TP} = \sum (\text{predictions} \times \text{targets}) $$

The if condition is non-differentiable and ambiguous. You may wonder for the $\max(a(x), b(x))$, they are also non-differentiable funciton. Why the previous one works but this one does not work. It is obviously, for $\max(a(x), b(x))$,

$$ \max(a(x), b(x)) = \begin{cases} a(x) & \text{if }a(x) > b(x) \\ b(x) & \text{otherwise} \end{cases} $$

Except at points where $a(x)=b(x)$, the gradient is undefined; at all other points it is well-defined and corresponds to either $a'(x)$ or $b'(x)$. In contrast, for the threshold function described above, the gradient is 0 everywhere except when $\hat{y} = 0.5$ (not differentiable) a situation often sometimes referred to as a "dead neuron." Similarly, with the ReLU function, the gradient is 0 for $x \leq 0$, which is why it is commonly known as the "dead ReLU".

Consider another perspective for the previous precision metric: if you have a predicted value $\hat{y}$ of 0.6 and later for the same sample, a new $\hat{y}$ value of the same sample of 0.7, they yield the same positive prediction. This implies that the change in $\hat{y}$ (which can be traced back to changes in the model parameters) does not affect the precision calculation, the relationshipe between $\hat{y}$ and $J$ is broken. As a result, computing the necessary gradients for optimization becomes problematic. In effect, if one were to use the values 0.6 and 0.7 directly during optimization, one would essentially be optimizing the loss function rather than the precision metric itself.

Surrogate Loss

To address this issue, an option is to use surrogate losses—differentiable approximations that mimic the behavior of the precision metric but allow for smooth gradient propagation during optimization.

A surrogate loss for precision replaces the non-differentiable threshold function with a smooth, continuous approximation.

Give a Try

For instance, instead of applying a hard cutoff to the model output, one can use the raw output $\hat{y}$ directly in the metric computation. In this surrogate approach, the true positives and false positives are computed in a differentiable manner. Specifically, we define the soft true positives as

$$ \text{Soft TP} = \sum_i \hat{y}^{(i)} \, y^{(i)}, $$

and the soft false positives as

$$ \text{Soft FP} = \sum_i \hat{y}^{(i)} \, (1 - y^{(i)}). $$

Using these quantities, the soft precision is given by

$$ \text{Soft Precision} = \frac{\text{Soft TP}}{\text{Soft TP} + \text{Soft FP}}. $$

This modification ensures that even small changes in $\hat{y}$ lead to proportional changes in the soft true positives and false positives, which in turn provide meaningful gradient information. As a result, the model can learn effectively through gradient descent, continuously updating its parameters to improve the surrogate metric.