Neural networks, defined as $\mathbf{y} = f_\theta(\mathbf{x})$, do not natively formulate an optimization problem. Instead, we measure how different the predictions $\hat{\mathbf{y}}$ are from the ground truth $\mathbf{y}$ using a loss function, which quantifies the error.
The aim is to minimize the quantified difference between $\hat{\mathbf{y}}$ and $\mathbf{y}$, which is where optimization comes into play, guiding adjustments to the model parameters $\theta$. Neural network training, at its core, revolves around minimizing this loss to improve the model's predictions.
For a single sample, the loss function is defined as:
$$ \mathcal{L}: \{\hat{\mathbf{y}}^{(i)}, \mathbf{y}^{(i)} \}\rightarrow J^{(i)}, $$
where $\hat{\mathbf{y}}^{(i)} = f_\theta(\mathbf{x}^{(i)})$ is the model output for input $\mathbf{x}^{(i)}$ . Minimizing the aggregate loss across all training samples is the central goal of neural network training.
Here are the two examples:
In regression tasks where the output is a continuous univariate value, the squared error loss for a sample is given as follows:
$$ J^{(i)}=\mathcal{L}(\hat{y}^{(i)}, {y}^{(i)}) = (\hat{y}^{(i)} - y^{(i)})^2. $$
In binary classification tasks, the Negative Log Likelihood (NLL) loss is given below,
$$ J^{(i)}=\mathcal{L}(\hat{y}^{(i)}, {y}^{(i)}) = - \bigl[y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})\bigr] $$
it quantifies how far the predicted probability diverges from the true label. Similarly, although not common, for clustering, the Within-Cluster Sum of Squares (WCSS) loss measures the variance within clusters to ensure compact groupings.
For all training samples, single sample losses are aggregated, forming the overall loss function $J(\theta)$:
$$ \small \argmin_{\theta}J(\theta) = \sum_{i}{\mathcal{L}\bigl(f_\theta(\mathbf{x}^{(i)}), \mathbf{y}^{(i)}\bigr)} \quad \text{OR} \quad \argmin_{\theta}J(\theta) = \frac{1}{N}\sum_{i}{\mathcal{L}\bigl(f_\theta(\mathbf{x}^{(i)}), \mathbf{y}^{(i)}\bigr)}. $$
The left equation simply sums up the loss for all samples, which is simpler in implementation as it does not involve averaging. The right equation, which divides by the number of samples, is preferred when dealing with complex training environments to prevent the gradient from being amplified by the number of samples.
Note: In machine learning tasks, the input variables are denoted as $\mathbf{x}$ and are typically provided, whereas the parameters to be optimized are represented by $\theta$. The function $J(\theta)$ defines the loss surface, illustrating how the overall loss $J$ changes in response to variations in $\theta$. Optimizing a neural network, such as through gradient descent, fundamentally involves seeking the lowest points or 'valleys' on the $J(\theta)$ surface.
In simpler models such as logistic regression and K-means, which have few parameters, visualizing loss functions—like NLL for logistic regression and WCSS for clustering—offers insights into their behavior and the operations of gradient descent, as shown in the figures below.
https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/logistic_unique_minima_cube_scale.json
The Negative Log-Likelihood (NLL) loss function for logistic regression, as illustrated by $\text{NLL}(\sigma(w_1 x_1 + w_2 x_2 + b), y)$, depends on just two parameters, $\theta = [w_1, w_2]$, where $b$ is fixed. This configuration results in a loss surface with a single minimum. When applying gradient descent optimization, all initial parameter values converge toward this distinct concave region, which represents the global minimum. Achieving the global minimum is essential, as it corresponds to the lowest possible loss across all parameter settings.
https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/kmeans_loss_surf.json
The Within-Cluster Sum of Squares (WCSS) loss function for K-means, in the case of a univariate dataset with $K=2$, exhibits a flower-like shape with two optimal solutions. This shape’s petals represent potential starting points for gradient descent, leading to distinct convergence paths: one converges to the left, and the other to the right, each reaching a local minimum. Though these regions appear separate, they represent equivalent solutions in practice, differing only in the labeling of the cluster centers.