The figure compares error values from various loss functions when the actual output label $y$ is 1, focusing on the discrepancies between predicted $\hat{y}$ and true values $y$.
In the figure, the subfigure on the left denotes Mean Squared Error (MSE) loss, while the subfigure on the right denotes the Negative Log-Likelihood (NLL) loss.
$$ \text{MSE}(\hat{y}, y) = (y-\hat{y})^2 $$
$$ \text{NLL}(\hat{y}, y) = y\log(\hat{y}) + (1-y)\log(1-\hat{y}) $$
The figure shows that while the NLL curve increases more steeply, both curves display a similar shape.
This raises the critical issue of whether MSE is appropriate for logistic regression classification tasks, which can be examined from multiple angles.
Optimization Difficulty: When using the MSE as the loss function for logistic regression, the optimization problem becomes non-convex. This non-convexity arises because logistic regression outputs probabilities through the sigmoid function, which introduces a non-linear transformation. This behavior makes the optimization process challenging, hindering the effective training of the logistic regression model. In contrast, using a steeper loss function like cross-entropy ensures a convex loss surface with a unique global minimum, facilitating a more straightforward and reliable optimization process.
Probabilistic Interpretation: MSE is typically associated with regression tasks in which continuous outcomes are predicted, and it provides a clear probabilistic interpretation, primarily focusing on the error distribution following a Gaussian Distribution. However, logistic regression deals with the probabilities of categorical outcomes, and its standard loss function, the NLL, offers a direct probabilistic interpretation for observing the appearance of different categories. Using MSE in this context might obscure the probabilistic meanings and relationships inherent in logistic regression analysis.
The choice of loss function significantly impacts how we interpret data and errors. Generally, a loss function is pivotal in establishing the relationship between the predicted values and the actual labels for each sample. Additionally, the chosen loss function reflects the underlying assumptions of our prediction model.
In the conception of the Mean Squared Error (MSE) loss function, certain design choices are not arbitrary but rather the result of deliberate and nuanced reasoning:
In synthesizing these elements into the MSE framework, the design reflects a calculated balance of mathematical rigor, practical utility, and conceptual clarity, culminating in a loss function that is both robust and interpretable in its facilitation of model optimization.
It is worth noting that the loss function is usually reflect our need. Although we can theoretically define any loss function, optimizing it is often fraught with difficulties. This is exemplified when attempting to optimize specific performance metrics derived from the Confusion Matrix—like precision, sensitivity, and specificity.