Loss Function Recap

A loss function (cost function), is a mathematical function that quantifies the difference between a model's predicted output and the actual output. It measures the error in the model's predictions, thereby formulating the optimization goal of minimizing this error during training.

Common examples of loss functions include Mean Squared Error (MSE) for regression tasks and Negative Log Likelihood (NLL) for classification tasks.

$$ \mathcal{L}_\text{MSE}(\hat{y}, y) = (y-\hat{y})^2 $$

$$ \begin{align*}\mathcal{L}_\text{NLL}(\hat{y}, y) = &-y\log(\hat{y}) \\&- (1-y)\log(1-\hat{y})\end{align*} $$

Untitled

Code

If you compare the two curves, you will observe that the NLL curve decreases more steeply around 0.2; however, both curves exhibit a similar shape.

Why not MSE for Logistic Regression?

Some might wonder why Mean Squared Error (MSE) is not suitable for logistic regression tasks in classification. The issue primarily revolves around the convexity of the loss function, $J(\theta)$. When MSE is used as the loss function for logistic regression, the resulting optimization problem becomes non-convex. This non-convexity stems from the fact that logistic regression models output probabilities via the sigmoid function, a non-linear transformation.

https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/logi_reg_nll_vs_mse_cube.json

The MSE plot on the right demonstrates how sigmoid nonlinearity complicates the loss surface. Although there appears to be a distinct valley, if we consider the case the initial guess of the parameter values starting from the top yellow region reveals a very small gradient, indicating that gradient descent will progress slowly in this area.

Conversely, employing a steep loss function, such as NLL on the left, "push down" the flat yellow area and fosters a "perfectly" bowl-shaped convex loss landscape. This adjustment enhances the gradient properties throughout most of the landscape, except at the very bottom, where all other regions exhibit high gradient values that help accelerate convergence to the bowl's base.