Convexity and Neural Network Loss Surface

When discussing the optimization of neural networks, we often mention the challenges of "Non-convex Optimization." Many may wonder why it's important to emphasize whether an optimization problem is 'convex.' What makes it unique?

In short, if an optimization problem is "convex," it means that the problem has a certain "favorable" structure: theoretically (or under certain conditions), a convex optimization problem guarantees that there is only one global optimum. However, if the problem is 'non-convex,' the situation isn't as straightforward; we might encounter local optima, saddle points, and other complex scenarios.

The training of neural networks fundamentally involves optimizing a high-dimensional, multivariable function. Due to the network structure (such as the nonlinear activation functions in multilayer perceptrons), this optimization problem is often non-convex. Non-convex optimization presents many challenges, such as how to avoid getting trapped in poor local optima and how to better utilize Stochastic Gradient Descent (SGD) or its variants.

To fully understand all this, we must start with the concept of "convexity" itself—clarifying what 'convex sets' and 'convex functions' are, and why they make an optimization problem 'good' to the extent of having only one global optimum. Afterwards, we can explore what difficulties arise when this 'convex' structure is broken, and we enter the 'non-convex' world, along with potential solutions.

Convex Function and Optimization

Definition of a Convex Function: A function $f: \mathbb{R}^n \to \mathbb{R}$ is convex if, for any $w, u \in \mathbb{R}^n$ and $\alpha \in [0,1]$, the following condition holds (also known as Jensen's inequality):

$$ f(\alpha w + (1 - \alpha) u) \leq \alpha f(w) + (1 - \alpha) f(u). $$

Geometrically, this means that the graph of the function always lies 'on or below' the line segment (chord) connecting any two points on the graph. In other words, if you draw a straight line between any two points on the function's graph, the entire line will lie above or coincide with the function’s graph. A common example is the quadratic function $f(w) = w^2$ in one dimension, which is convex; you can visualize it as having a 'bowl' shape, as shown in the illustration on the left."

In contrast, some (most) non-convex functions, as shown in the right figure, do not necessarily have the chord above the function.

Benefits of Convex Functions: If we want to minimize a convex function, finding a point where the first derivative (gradient) is zero or meets specific conditions ensures that this point is the global optimum. If some constraints are also considered and the feasible region is convex, then the entire optimization problem becomes a convex optimization problem, which typically has well-developed theories and algorithms for resolution. Typical algorithms include gradient descent and Newton's method.

Convex Optimization and Machine Learning: In traditional machine learning models (such as linear regression and logistic regression), optimization is usually convex, and solutions can often be found using classical optimizers or even direct analytical methods. Thus, there is less concern about getting trapped in local optima. However, this is not the case with neural networks: due to complex network structures and nonlinear activation functions, their objective functions are often non-convex, leading to non-convex optimization challenges.

Non-Convex Optimization and Neural Networks

Understanding Non-Convex Functions: Non-convex functions may feature multiple local optima and saddle points, where saddle points are flat or inflection points on the function surface. These characteristics mean that finding a global optimum involves more than just 'going downhill,' as this approach can easily result in getting stuck in local minima without reaching the global minimum.

The situation with Gaussian functions is slightly different. For example, consider the function $f(x, y) = e^{-(x^2 + y^2)}$, which approaches zero as it moves away from the origin $(0,0)$, and its gradient also approaches zero. This is often referred to as a 'plateau.' This implies that if the initial values of $x$ and $y$ are large, the speed of the gradient descent algorithm will be very slow, making it inefficient in these cases.

https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/subplot_4_function_plot_px800.json

In the high-dimensional parameter space of neural networks, the 'terrain' can be unimaginably complex: there are thousands of 'valleys,' along with countless twisted hills or cliffs. Finding a relatively good solution in such complex terrain is a major challenge in neural network training."

How to Prove Neural Network is Non-convex

A typical MLP features multiple layers of linear transformations and nonlinear activations. Suppose we use a loss function $\mathcal{L}(\theta)$ to measure the error of network parameters $\theta$ on training data. Due to the introduction of activation functions (such as ReLU, Sigmoid, Tanh, etc.), the overall output of the network is no longer a linear combination of the parameters, hence $\mathcal{L}(\theta)$ often lacks convexity. There are two approaches to proving this, the first being simpler and the second more complex:

(Definition) Identify two specific sets of parameters $\mathbf{w}^{(a)}$ and $\mathbf{w}^{(b)}$ such that

$$ \frac{f(\mathbf{w}^{(a)}) + f(\mathbf{w}^{(b)})}{2} >f(\frac{\mathbf{w}^{(a)} + \mathbf{w}^{(b)}}{2}) $$

If we can find $\mathbf{w}^{(a)}$ and $\mathbf{w}^{(b)}$ that satisfy this condition, it proves that Jensen's inequality does not hold, indicating that $f$ is not a convex function.