Batch and Stochastic Gradient Descent

In the previous lectures, we learned about the 'Gradient Descent' algorithm and its application to specific problems. In addition, we know that neural networks represent a non-convex optimization problem. You may also have heard that neural networks often utilize an algorithm called 'Stochastic Gradient Descent' (SGD). In this section, we will explore what SGD is and how it differs from the conventional Gradient Descent method.

Batch Gradient Descent

If all samples in the training dataset, from $1$ to $N$, are considered for gradient calculation, such a gradient descent algorithm applied to this optimization target is known as 'Batch Gradient Descent.' Here, 'batch' refers to the use of all samples for computing the gradient.

An Issue in Batch Gradient Descent: Since neural networks are not convex optimization problems, a greedy gradient descent approach will most likely lead to a suboptimal local minimum. This is because, if the loss surface and algorithm remain unchanged, the starting point fully determines the endpoint.

However, neural networks have many local minima, and while changing the starting point to avoid these local minima is theoretically possible, it is not practically effective—especially when each training iteration of the neural network takes a long time.

Untitled

Stochastic by Sampling

A straightforward idea to introduce variability during the optimization process is to add some randomness to the gradient calculation. The randomness is embodied in stochastic gradient descent, but rather than adding noise to the gradient, SGD is implemented more efficiently, with the fundamental mechanism (e.g., the learning rate $\eta$ and the update rule $\theta := \theta - \eta\nabla$) remaining intact.

Intuition

What are the benefits of this randomness? Consider our goal to find the minimum value of a function using gradient descent. This function is the Peaks function found in Matlab.

The red curve in the diagram represents the result of deterministic gradient descent, which ultimately converges to a local minimum.

The blue curve, on the other hand, represents gradient descent with introduced randomness. Due to this randomness, it escapes the initial local minimum and eventually converges to a more optimal minimum.

Note: the situation shown here is a simulation. In real scenarios, randomness does not always guarantee finding a better minimum, but it does increase the likelihood of escaping local minima and seeking global optima.

https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/peaks_function_with_gradients_300.json

Imperfect Loss Surface

In deep learning, how do we introduce randomness into gradient descent? Some might think that simply adding a noise component to the gradient would suffice. Indeed, this method is feasible, but it is not easy to define how much noise should be added. Additionally, generating sampling noise itself also brings extra computational overhead, especially for models with many parameters, like neural networks.

In fact, deep learning achieves randomness in a very clever way, which is through random sampling of samples. This method is not only efficient but also naturally integrates into the training process.

Returning to the previous discussion, our loss surface is usually based on calculations from all samples. In this case, the loss surface is a fixed function of the parameters (without randomness), usually as shown in the image on the far left (as in the case of logistic regression). However, if we do not use all samples but randomly select some of them for calculation, the resulting loss surface will change. It's easy to understand: the sampled subset may deviate from the mean of the overall samples, so the loss surface and the resulting decision boundaries calculated from these samples may differ from those calculated using all samples.

Loss Surface of Logistic Regression, using 2D synth Data

On such a loss surface, we can perform gradient descent. Due to the randomness of sampling, different loss surfaces lead to different directions of descent. These directions usually have some correlation with the original gradient direction (for example, they might still lean to the right), but they also introduce additional randomness, making the optimization path more diverse. This phenomenon can be understood as adding a kind of random walk behavior to the original gradient direction.

Loss Surface of Logistic Regression and Gradient, using 2D synth Data (from 0, 0)

This randomness introduced through random sampling is exactly the characteristic needed in deep learning. It means that these random deviations can sometimes help prevent the gradient descent algorithm from converging too quickly to local optima. This random deviation mechanism forms the core foundation of what we commonly refer to as stochastic gradient descent, and it's one of the key differences between stochastic gradient descent and the traditional method of computing gradients using all samples.

It's important to note that in deep learning, the primary reason for using subsets of samples (i.e., mini-batches) is not solely to introduce randomness, but also due to computational cost considerations. Calculating gradients using mini-batches significantly reduces computational costs and naturally introduces beneficial randomness into the optimization process. This design is a clever balance that provides important support for the efficient training of deep learning models.