The standard gradient descent algorithm, while foundational for optimizing mathematical functions in machine learning, faces challenges in efficiently navigating the multi-dimensional loss landscapes inherent to complex models. Its primary limitation lies in its susceptibility to oscillations or slow convergence in steep, narrow valleys and its tendency to become trapped in local minima, impeding its ability to find the global minimum.

This inefficiency stems from its reliance on the local gradient at each iteration, without considering the context of previous steps. This can lead to suboptimal update directions and step sizes, causing erratic or protracted paths towards convergence.

The introduction of momentum in optimization algorithms addresses these issues by incorporating a memory of past gradients. This modification not only accelerates the convergence process but also enhances stability in descent direction, mitigating the erratic movements associated with steep gradients and aiding in the escape from local optima.

Benefits of Momentum

Momentum in SGD can be viewed as a noise filter, aggregating stochastic gradients over iterations to smooth out erratic updates from noisy, small data subsets. This process, akin to averaging out noise, ensures a steadier direction towards the minimum by diluting the impact of any single, misleading gradient.

Untitled

It also addresses the issue of progress negation in ravine-like areas of the loss landscape, where gradients can frequently point in nearly opposite directions. By integrating past gradients, momentum reduces these directional cancellations, promoting a more consistent and swift progression towards the goal.

Untitled

Furthermore, momentum introduces an inertia-like quality, enabling the optimization to bypass shallow local minima more effectively. This inertia, imagined as giving a ball more weight to roll past minor dips, helps the algorithm to seek out deeper, more significant minima, thus preventing premature convergence to less optimal points. This streamlined approach significantly enhances SGD's efficiency, offering a more reliable path through complex loss landscapes.

Untitled

How to Add Momentum

Momentum in gradient descent enhances the update rule by incorporating a momentum term, $v_t$ (also known as velocity), which combines the current gradient with a portion of the previous update, allowing past gradients to influence the current direction and speed of descent. This is represented mathematically as:

$$ \begin{aligned} &\nabla_{\theta} J(\theta) &\leftarrow &\text{gradient} \\ v_{t} =& \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta) & \leftarrow &\text{add momentum} \quad (v_{t-1}) \\ \theta =& \theta - v_{t} & \leftarrow &\text{gradient descent} \end{aligned} $$

The equations outlined illustrate the process of updating the gradient and parameter during a single iteration, broken down into three steps:

  1. Calculate the gradient: The current gradient of the loss function with respect to the parameters, $\theta$, is denoted by $\nabla_{\theta} J(\theta)$.

  2. Update the velocity: The new velocity, $v_{t}$, is updated according to the equation $v_{t} = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta)$, where $\gamma$ (the momentum coefficient that ranges between 0 and 1) scales the previous velocity $v_{t-1}$, determining the extent to which historical gradients are incorporated, and $\eta$ represents the learning rate. This equation can be further expanded as:

    $$ v_{t} = \eta (\gamma^{t-1} \nabla_{\theta, 1} J(\theta) +\cdots + \gamma^{1} \nabla_{\theta, t-1} J(\theta) +\gamma^{0} \nabla_{\theta, t} J(\theta)) $$

    This expansion illustrates that the current velocity $v_t$ reflects the accumulation of gradients from all previous steps, with earlier gradients exponentially diminishing in influence.

  3. Update the parameters: The model parameters are updated using the equation $\theta = \theta - v_{t}$, applying the velocity to adjust the parameters.