The two loss surfaces mentioned earlier represent different types of functions: convex and non-convex. Let's clarify their definitions.
To begin, let's define a convex function. A function $f(x)$ is considered convex if it satisfies the following inequality for any two points $x_1$ and $x_2$ and any value of $\lambda$ between 0 and 1:
$$ f(\lambda x_1 + (1-\lambda) x_2) \leq \lambda f(x_1) + (1-\lambda) f(x_2) $$
Intuitively, this means that any chord (a straight line segment connecting any two points on the function) always lies above the function's graph.
For such convex functions, they have only a unique minimum point. It is easy to prove that if they have two separate minimum points, there would be a rise high between them, and the line connecting the two minimum points would not satisfy the condition mentioned above.
(Optional) 2nd Order Criteria to Determine Convex Functions
In the case of multivariate functions, a sufficient but not necessary condition for a function to be convex is that its second derivative (Hessian matrix) is non-negative. For all vectors $x$, if the second derivative $f''(x)\geq0$, it means that the function's curvature does not go downward, ensuring a convex shape of the function graph.
More specifically, we can determine the convexity of a function by checking whether its Hessian matrix is positive definite. The Hessian matrix $H$ is a square matrix composed of all second partial derivatives of the function $f$. For a bivariate function $f(w_1,w_2)$, its Hessian matrix is:
$$ H = \begin{bmatrix}\frac{\partial^2 f}{\partial w_1^2} & \frac{\partial^2 f}{\partial w_1 \partial w_2} \\\frac{\partial^2 f}{\partial w_2 \partial w_1} & \frac{\partial^2 f}{\partial w_2^2}\end{bmatrix} $$
A common method for determining whether a matrix is positive definite is to compute its eigenvalues. If all eigenvalues of the Hessian matrix $H$ are positive, we can conclude that matrix $H$ is positive definite, and thus, function $f$ is convex. In the context of optimization problems that deal with multivariate functions, this holds particular significance as it enables us to evaluate whether a loss surface possesses a solitary minimum point that serves as the global minimum.
Proof: the NLL loss of Logistic Regression is a Convex Function.
In logistic regression, the model predicts the probability of a binary outcome $y \in {0, 1}$ given input $x$. The predicted probability is modeled using the logistic (sigmoid) function:
$$ \hat{y}=p(y = 1 | x; \theta) = \sigma(\theta^T x) = \frac{1}{1 + \exp(-\theta^T x)} $$
where $\theta$ represents the weights (parameters) of the model, and $x$ is the input feature vector. The probability of $y = 0$ is simply $1 - p(y = 1 | x; \theta)$.
The Negative Log-Likelihood (NLL) loss function for logistic regression is defined as follows for a single data point $(x^{(i)}, y^{(i)})$:
$$ L(\theta) = - y^{(i)} \log(\sigma(\theta^T x^{(i)})) - (1 - y^{(i)}) \log(1 - \sigma(\theta^T x^{(i)})) $$
where $\sigma(z)$ is the logistic sigmoid function.
To prove that this loss function is convex, we need to show that the second derivative (the Hessian matrix in the multivariate case) of the loss function is positive semi-definite.
Step 1: First Derivative of NLL
The first step is to compute the gradient (first derivative) of the NLL loss with respect to the parameter vector $\theta$.
Recall that: