We have previously learned about the 'Gradient Descent' algorithm. However, it seems that neural networks commonly use a similar algorithm called 'Stochastic Gradient Descent' (SGD). What are the differences between them?

Our prior discussion focused on using Gradient Descent to solve the following problem:

$$ \min_{W_1, W_2 } \frac{1}{N}\sum_{i=1}^{N}{L\left(w_2\sigma\left(w_1x^{(i)}\right), y^{(i)}\right)} $$

In this objective, all samples in the training dataset from $1$ to $N$ are considered for gradient calculation, such a gradient descent algorithm applied to this optimization target is known as 'Batch Gradient Descent'. Here, 'batch' refers to the use of all samples for computing the gradient.

A Sever Issue in Batch Gradient Descent: Since neural networks are not convex optimization problems, in theory, a greedy approach will most likely lead to a suboptimal local minimum. This is because if the loss surface and algorithm remain unchanged, the starting point fully determines the endpoint. However, neural networks have many local minima, and while changing the starting point to avoid these local minima is theoretically possible but not practically effective, especially when each training iteration of the neural network takes a long time.

Untitled

Stochastic: Imperfect Loss Surface

An straightforward idea to introduce variability during the optimization process is by adding some randomness to the gradient descent. This approach is embodied in stochastic gradient descent (SGD), where the fundamental mechanism (e.g., the learning rate $\eta$, and the update rule $\theta :=\theta - \eta\nabla$) remains intact. However, unlike traditional gradient descent that computes gradients based on the entire dataset, SGD perturbs the loss surface—and consequently the gradient $\nabla$—by using different subsets of the dataset across iterations.

This variation in the loss surface arises because each subset of the dataset approximates the overall distribution differently. When only a part of the dataset is employed for gradient computation, the resulting loss surface reflects the characteristics of that subset rather than the entire data distribution. This can lead to a different gradient estimation in each iteration, introducing noise into the optimization process. For instance, employing just 64 samples for gradient calculation means the loss surface and gradient are estimated based on the information provided by these 64 samples only.

This stochastic nature of the gradient approximation introduces variability, which can help in escaping local minima and provides a form of regularization, making the model less likely to overfit to the training data. The cumulative effect of these variations over many iterations can lead to a more robust and generalized model by exploring a broader range of solutions in the parameter space.

his design proves advantageous because gradient descent by nature is greedy, aiming for the steepest descent based on the immediate gradient without regard for longer-term outcomes. Given the highly irregular loss surface typical of neural networks, a purely greedy approach starting from an inopportune initial point often converges to suboptimal solutions. By incorporating imperfect gradients through the use of random subsets, SGD introduces an element of randomness into the optimization process. This randomness, rather than being purely disruptive, can be beneficial—it occasionally enables the algorithm to bypass poor local minima that would ensnare a more deterministic approach, potentially guiding it towards superior solutions. This mechanism of leveraging stochasticity to escape suboptimal attractors underscores the strategic balance between exploration and exploitation in the search for global minima within the complex landscape of neural network training.

Visualization Example

In this example, we will generate a 1-dimensional, binary classification dataset $x$ and $y$ through code (where some $x$ belongs to category 0 and some to category 1). The histogram of these data is shown below:

Untitled

Note that the reason we generate such one-dimensional data is that, if we were to generate high-dimensional data, it would require more parameters. In such cases, we would not be able to visualize these parameters. Within the limitation of human vision, with a naïve visualization design, we can at most understand two parameters and the loss surface in a three-dimension space.

We simplify $w_1$ and $w_2$ as scalars, and then we directly enumerate different parameter combinations to construct the Loss surface, as shown in the following figure. The left figure represents the Loss surface computed for all 2000 samples, while the right figure represents the loss surface computed for a randomly selected subset of 64 samples.

newplot (51) (1).png

When comparing the gradient obtained from the 64-sample loss function to the ideal gradient derived from the entire dataset, it can be seen as the optimal gradient direction with added random perturbation, due to the imperfect representation of the entire dataset by the subset.

newplot (51) (2).png

The left top area denotes a lower loss in perfect loss surface, but if we use the perfect gradient from the entire dataset, then it will guide the parameters $w_1$ and $w_2$ adjusted toward the wrong direction. Instead, selecting 64 random samples each time to calculate the gradient can be seen as making an optimal adjustment based on the entire dataset's loss function, but with an added random deviation. in this step, the stochastic gradient gives a better adjust of the two parameters.