It is valuable to gain the knowledge about Loss surfaces, including their definitions and properties like convexity. At the same time, we should be familiar with visualizations of low-dimensional Loss Surfaces for some examples. This makes it easier for us to intuitively understand why certain gradient descent algorithms work.

Definition of Loss Surface

The essence of a loss surface is to describe the relationship between the loss value $J$ and the parameters $\theta$. In other words, this allows analysing when changing the parameters $\theta$ what would happen to the loss value $J$.

$$ J(\theta) = {\sum_i{L(f(x^{(i)}, \theta), y^{(i)})}} $$

Here, the loss function $L$ is usually defined on the model's output $\hat{y}$ and the label $y$, i.e., $L(\hat{y}, y)$, while the loss surface $J(\theta)$ is defined on the parameters $\theta$.

Visualization of Loss Surface

To visualize the loss surface, it usually involves sampling each parameter $w_1, w_2, \cdots w_k \in \theta$ within a certain range. For instance, if we have two parameters $w_1$ and $w_2$, and each parameter is sampled with 10 different values, we will end up with $10 \times 10$ combinations of candidate parameters, each candidate is a vector of 2 parameter values.

By computing the loss ${\sum_i{L(f(x^{(i)}; w_1, w_2 ), y^{(i)})}}$ for each input-output pair $(x^{(i)}, y^{(i)})$ with the given parameter candidate, we obtain the loss value $J$ for that specific candidate $[w_1, w_2]$. These values can be organized as $(w_1, w_2, J)$ and visualized in a 3D space, where the x-axis represents $w_1$, the y-axis represents $w_2$, and the z-axis represents $J$.

Loss Surface Visualization Example

In some simpler models, such as logistic regression, and the K-means algorithm with only two to three parameters, we can easily visualize their loss functions. Through these visualizations, we can gain insights into the characteristics of these loss functions.

Consider the Negative Log-Likelihood (NLL) loss function $\text{NLL}(\sigma(w\cdot x+b), y)$ for logistic regression with just two parameters $\theta=[w, b]$ as shown on the left in the figure below. Here, we see that the loss surface exhibits a single concave region. Regardless of the initial parameter values, when optimized using gradient descent, they will eventually converge to this sole concave region. This point is referred to as the global minimum because, among all possible parameter combinations, only this region corresponds to the minimum loss value.

https://prod-files-secure.s3.us-west-2.amazonaws.com/98e1154f-7d91-43c3-9865-4920a452df39/192efd7f-1ece-483f-98ae-e84138a1ae92/newplot_(2)_(1).png

Consider univariate datasets and K-means algorithm with $K=2$. Visualizing their Within-Cluster Sum of Squares (WCSS) loss function involves sampling the positions of each cluster center, as shown on the right side of the figure above. This shape resembles a flower and contains two optimal solutions (see 2 small circles in the centre on the contour line). Consequently, gradient descent-based algorithms may yield different convergence results based on the initial starting point. Starting from the left petal may converge to the left concave (local minimum), while starting from the right petal may converge to the right concave (another local minimum). In fact, these two concave regions represent the same solution in practical applications (i.e., swapping cluster center point indices), they are mathematically distinct.

Gradient Descent is not about Visualization

In actual gradient descent optimization for neural networks, we don't comprehensively sample the entire parameter space; instead, we focus on tracking changes in a single sample of parameters. Moreover, neural networks typically have a high-dimensional loss surface due to the numerous parameters, making it impractical to have a complete and analyzable loss surface in real-world scenarios. This underscores the need to approach loss surface visualizations with caution and recognize their limitations.

Convex and Non-Convex Functions

The two loss surfaces mentioned earlier represent different types of functions: convex and non-convex. Let's clarify their definitions.

To begin, let's define a convex function. A function $f(x)$ is considered convex if it satisfies the following inequality for any two points $x_1$ and $x_2$ and any value of $\lambda$ between 0 and 1:

$$ f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2) $$

Intuitively, this means that any chord (a straight line segment connecting any two points on the function) always lies above the function's graph.