Gradient Accumulation - Momentum

Standard gradient descent, fundamental in machine learning optimization, often struggles with oscillations or slow convergence in steep areas and can become trapped in local minima due to its reliance on local gradients alone. This leads to inefficient paths towards convergence. The addition of momentum to optimization algorithms helps overcome these challenges by using a memory of past gradients, which not only speeds up convergence but also stabilizes the descent direction, reducing erratic movements and aiding in escaping local optima.

Benefits of Momentum

Momentum in gradient descent can be viewed as a noise filter, aggregating stochastic gradients over iterations to smooth out erratic updates from noisy, small data subsets. This process, akin to averaging out noise, ensures a steadier direction towards the minimum by diluting the impact of any single, misleading gradient.

Untitled

It also addresses the issue of progress negation in ravine-like areas of the loss landscape, where gradients can frequently point in nearly opposite directions. By integrating past gradients, momentum reduces these directional cancellations, promoting a more consistent and swift progression towards the goal.

Untitled

Furthermore, momentum introduces an inertia-like quality, enabling the optimization to bypass shallow local minima more effectively. This inertia, imagined as giving a ball more weight to roll past minor dips, helps the algorithm to seek out deeper, more significant minima, thus preventing premature convergence to less optimal points. This streamlined approach significantly enhances gradient descent's efficiency, offering a more reliable path through complex loss landscapes.

Untitled

Momentum Implementation

Momentum in gradient descent enhances the update rule by incorporating a momentum term, $v_t$ (also known as velocity), which combines the current gradient with a portion of the previous update, allowing past gradients to influence the current direction and speed of descent. This is represented mathematically as:

$$ \begin{aligned} &\nabla_{\theta} J(\theta) &\leftarrow &\text{Calculate the gradient} \\ v_{t} =& \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta) & \leftarrow &\text{Accumulate momentum} \quad (v_{t-1}) \\ \theta =& \theta - v_{t} & \leftarrow &\text{Update the Parameters} \end{aligned} $$

The equations outlined illustrate the process of updating the gradient and parameter during a single iteration, broken down into three steps:

Calculate the gradient: The current gradient of the loss function with respect to the parameters, $\theta$, is denoted by $\nabla_{\theta} J(\theta)$.
Update the velocity: The new velocity, $v_{t}$, is updated according to the equation $v_{t} = \gamma v_{t-1} + \eta \nabla_{\theta} J(\theta)$, where $\gamma$ (the momentum coefficient that ranges between 0 and 1) scales the previous velocity $v_{t-1}$, determining the extent to which historical gradients are incorporated, and $\eta$ represents the learning rate. This equation can be further expanded as:

$$ v_{t} = \eta (\gamma^{t-1} \nabla_{\theta, 1} J(\theta) +\cdots + \gamma^{1} \nabla_{\theta, t-1} J(\theta) +\gamma^{0} \nabla_{\theta, t} J(\theta)) $$

This expansion illustrates that the current velocity $v_t$ reflects the accumulation of gradients from all previous steps, with earlier gradients exponentially diminishing in influence.
Update the parameters: The model parameters are updated using the equation $\theta = \theta - v_{t}$, applying the velocity to adjust the parameters.

Adaptive Moment Estimation

The Adam optimization algorithm, short for "Adaptive Moment Estimation," builds upon the ideas of momentum and adaptive learning rates to provide efficient and effective optimization, especially for large-scale machine learning problems. Adam combines the benefits of two other extensions of stochastic gradient descent, namely momentum and RMSprop.

Adam updates the weights of the machine learning models by considering both the magnitude of the recent gradients (first moment) and the magnitude of recent squared gradients (second moment). Here’s how it is derived and implemented:

$$ \begin{aligned} \nabla_{\theta} J(\theta) &\leftarrow \text{Calculate the gradient} \\ m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla_{\theta} J(\theta) &\leftarrow \text{Estimate the First Moment} \\ v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla_{\theta} J(\theta))^2 &\leftarrow \text{Estimate the Second Moment} \\ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} &\leftarrow \text{Bias Correction} \\ \theta = \theta - \frac{\eta \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} &\leftarrow \text{Update the Parameters} \end{aligned} $$

Calculate the Gradient (First Moment): Like in standard gradient descent and momentum, Adam starts by computing the gradient of the loss function with respect to the parameters at each step: $\nabla_{\theta} J(\theta)$.