AdaGrad addresses the challenge of varying feature frequencies in datasets (common in natural language processing, some words are less observed) by tailoring the effective step size for each parameter, in contrast to Gradient Descent’s uniform learning rate.
By accumulating the square of past gradients for each parameter, AdaGrad naturally assigns larger updates to parameters that are updated infrequently, preventing underfitting for sparse features. Conversely, parameters updated more frequently receive smaller updates, thereby mitigating overshooting. Through this adaptive strategy—where each parameter’s step size depends on its own history of squared gradients—AdaGrad ensures more stable and efficient convergence in datasets with diverse and uneven feature distributions.
The AdaGrad method can be described by the following mathematical formula:
$$ \begin{aligned}&\nabla_{\theta} J(\theta) &\leftarrow &\text{gradient computation}\\G_{t,ii} =& G_{t-1,ii} + (\nabla_{\theta} J(\theta))^2 & \leftarrow &\text{sq. gradient accumulation}\\\theta =& \theta - \frac{\eta}{\sqrt{G_{t,ii} + \epsilon}} \cdot \nabla_{\theta} J(\theta) & \leftarrow &\text{parameter update}\end{aligned} $$
In this expression, $\nabla_{\theta} J(\theta)$ denotes the gradient of the loss $J$ with respect to the parameters $\theta$. The term $G_{t,ii}$ represents the diagonal of the accumulated squared gradients matrix up to time step $t$. The adaptive learning rate is then computed by normalizing the step size $\eta$ with the square root of $G_{t,ii}$ (i.e., the accumulated gradient energy of each parameter), adjusted by a small constant $\epsilon$ to prevent division by zero.
Strategy: the parameters with large historical gradients receive smaller adjustments, promoting cautious optimization in steep areas. In contrast, it amplifies updates for parameters with minor past gradients, encouraging significant shifts in flatter regions. This balanced approach enhances adaptive learning by ensuring that each parameter's update is inversely proportional to its accumulated gradient magnitude, optimizing convergence across diverse gradient landscapes.
The provided figure shows a 3D surface and a contour plot of a log-scaled loss function, with paths for two optimization algorithms: gradient descent and AdaGrad. The red path represents the trajectory of gradient descent, while the blue path represents the trajectory of AdaGrad.
The left plot shows a 3D loss surface with log-scaled loss (z-axis) and parameters Theta 1 and Theta 2, while the right plot offers a contour view of loss levels with color gradients. The surface is flatter along Theta 2 and steeper along Theta 1. Key observations include:
RMSProp, an evolution of AdaGrad, improves adaptive learning by using a decay factor to prioritize recent gradients. It computes an exponential moving average of squared gradients, $E_t$ , applying a decay rate, $\gamma$, balancing new and old gradient influences. This prevents the monotonic learning rate decay of AdaGrad, ensuring responsiveness to recent gradient changes. The update rule:
$$ \begin{aligned}&\nabla_{\theta} J(\theta) &\leftarrow &\text{gradient computation}\\ &E_t = \gamma E_{t-1} + (1 - \gamma)(\nabla_{\theta} J(\theta))^2 & \leftarrow &\text{exponential sq. gradient average}\\ &\theta = \theta - \frac{\eta}{\sqrt{E_t + \epsilon}} \cdot \nabla_{\theta} J(\theta) & \leftarrow &\text{parameter update} \end{aligned} $$
In the RMSProp update formula, $E_t$ represents the exponential moving average of squared gradients at time step $t$, distinguishing it from AdaGrad's simple sum of squared gradients. The decay rate, $\gamma$ modulates the influence of new gradients on this average, enhancing responsiveness to changes in gradient size. Like AdaGrad, RMSProp includes a small constant $\epsilon$ to prevent division by zero. This update mechanism allows RMSProp to adapt the learning rate based on recent gradient activities, overcoming AdaGrad's limitation of a continually decreasing learning rate, which can extend training unnecessarily.