Newton’s Method is a second-order optimization technique for optimization problems, that refines an estimate by leveraging both the function's first derivative (gradient) and second derivative (curvature).
Unlike simple gradient descent, which only considers the slope at a given point to guide the next step, Newton’s Method uses the second derivative (Hessian in higher dimensions) to adjust step sizes dynamically aka. the optimal step size. This means that in convex functions, Newton’s Method can often reach the minimum in fewer iterations compared to gradient-based methods. The key intuition is that the method approximates the function locally as a quadratic model and jumps directly to the optimal point of this approximation in each step.
Newton’s Method was originally designed for root finding, utilizing derivatives to iteratively converge to a function's zeros.
The method employs the formula $x[n+1] = x[n] - \frac{f(x[n])}{f'(x[n])}$ to calculate the point where the tangent line at $x_n$ intersects the x-axis. By iteratively updating $x$ using this intersection, the method moves closer to a point where $f(x) = 0$, effectively finding a root of the function.
This principle was extended to optimize finding by targeting the zeros of the first derivative ($\frac{df}{dx}=f'(x)=0$), effectively seeking points where the slope is zero—potential minima or maxima.
In this adaptation, the method uses both the first and second derivatives to update the guess,
$$ x[n+1] = x[n] - \frac{f'(x[n])}{f''(x[n])} $$
where the second derivative helps determine the nature of the critical point (minima, maxima, or saddle points) by assessing the curvature.
Newton's Method is a root-finding algorithm, meaning the root of a function's derivative can correspond to a minimum, maximum, or saddle point. Therefore, this does not inherently ensure convergence to a minimum, especially for complex, non-convex functions. As a result, Newton's Method is less suitable for domains like neural network training, where loss surface $J$ are highly non-linear and non-convex.
Instead, it is more commonly employed in convex optimization scenarios, where the function's predictable curvature ensures that any critical point found is a global minimum, leveraging its faster convergence properties compared to methods like gradient descent. For example, it remains effective for simpler models like logistic regression.
import torch
import matplotlib.pyplot as plt
The NewtonSolver
class implements Newton's optimization method using PyTorch's automatic differentiation to compute first and second derivatives dynamically. It initializes with a function and provides methods to compute derivatives at a given point using torch.autograd.grad()
, ensuring computational efficiency and adaptability for differentiable functions. The step
method executes a single Newton iteration, updating x
using the Newton-Raphson formula: $x_{\text{new}} = x - f'(x) / f''(x)$. A stopping condition prevents division by zero when the second derivative is near zero, making the solver robust for numerical optimization and root-finding applications.
class NewtonSolver:
def __init__(self, func):
"""
Initializes the NewtonSolver with a function to optimize.
Args:
func: The function to minimize.
"""
self.func = func
def first_derivative(self, x):
"""Computes the first derivative of the function at x."""
x_tensor = torch.tensor(x, requires_grad=True)
y = self.func(x_tensor)
grad = torch.autograd.grad(y, x_tensor, create_graph=True)[0]
return grad
def second_derivative(self, x):
"""Computes the second derivative of the function at x."""
x_tensor = torch.tensor(x, requires_grad=True)
y = self.func(x_tensor)
# [0] is to get the only scalar in the vector which is of length 1.
grad = torch.autograd.grad(y, x_tensor, create_graph=True)[0]
grad2 = torch.autograd.grad(grad, x_tensor)[0]
return grad2
def step(self, x):
"""Performs one iteration step of Newton's Method."""
x_grad = self.first_derivative(x)
x_grad2 = self.second_derivative(x)
if abs(x_grad2) < 1e-6:
return x, self.func(torch.tensor(x)), True # Stop condition due to zero second derivative
x_new = x - x_grad / x_grad2
return x_new, self.func(torch.tensor(x_new)), False
Cubic Function:
Below is the definition of a function we aim to optimize to find its minimum value. This function is a cubic equation, represented as $f(x) = x^3 - 6x^2 + 9x + 1$. This type of function can have multiple turning points, making it an interesting candidate for Newton's Method, which is used to find local minima or maxima efficiently.
# Define the function to minimize
def f(x):
return x**3 - 6*x**2 + 9*x + 1
# Define the derivative of the function for visualization
def df(x):
return 3*x**2 - 12*x + 9