Reverse automatic differentiation (AD) is a powerful technique used in the computation of derivatives in machine learning models, particularly in the training of neural networks. Unlike symbolic differentiation, reverse AD computes derivatives numerically in a "step by step" fashion, ensuring efficiency, especially useful in implementing backpropagation.
Reverse AD works by traversing a computational graph from the outputs back to the inputs. This technique is vital in neural networks, where the computation graph represents the sequence of operations performed during the forward pass. By reversing this sequence, reverse AD efficiently computes gradients of loss L
with respect to each parameter w
and b
.
Below is a simplified NumPy-based implementation of autograd for the operations addition and multiplication. First, we define a class Variable
to represent nodes in our computation graph:
Then, let's use these definitions to compute the derivatives of the previous example neural network for optimization:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n_samples = 30 # This hand written univariate framework is very slow.
# Create two Gaussian clusters: one around (0,0) and one around (3,3)
X1 = np.random.randn(n_samples, 2) + np.array([0, 0])
X2 = np.random.randn(n_samples, 2) + np.array([3, 3])
X = np.vstack([X1, X2])
y_data = np.array([0] * n_samples + [1] * n_samples)
n_samples = X.shape[0]
# Initialize network parameters for a one hidden layer network with 2 neurons:
# Hidden layer parameters
w11 = Variable(np.random.randn(), requires_grad=True)
w12 = Variable(np.random.randn(), requires_grad=True)
b1 = Variable(np.random.randn(), requires_grad=True)
w21 = Variable(np.random.randn(), requires_grad=True)
w22 = Variable(np.random.randn(), requires_grad=True)
b2 = Variable(np.random.randn(), requires_grad=True)
# Output layer parameters
v1 = Variable(np.random.randn(), requires_grad=True)
v2 = Variable(np.random.randn(), requires_grad=True)
b3 = Variable(np.random.randn(), requires_grad=True)
learning_rate = 0.1
n_epochs = 1000
parameters = [w11, w12, b1, w21, w22, b2, v1, v2, b3]
acc_grad = {}
for epoch in range(n_epochs):
total_loss = 0.0
# Process each sample in the dataset
for param in parameters:
acc_grad[param.id] = 0
for i in range(n_samples):
# Create input Variables for the current sample.
# (These are non-leaf nodes, so they don't require gradients.)
x1 = Variable(X[i, 0], requires_grad=False)
x2 = Variable(X[i, 1], requires_grad=False)
target = Variable(y_data[i], requires_grad=False)
# --- Forward Pass ---
# Hidden layer:
z1 = w11 * x1 + w12 * x2 + b1
z2 = w21 * x1 + w22 * x2 + b2
h1 = sigmoid(z1)
h2 = sigmoid(z2)
# Output layer:
z_out = v1 * h1 + v2 * h2 + b3
o = sigmoid(z_out)
# Compute binary cross-entropy loss:
# loss = -[target * log(o) + (1 - target) * log(1 - o)]
one = Variable(1.0, requires_grad=False)
loss = - (target * log(o) + (one - target) * log(one - o))
loss.backward()
total_loss += loss.data
# Accumulate gradient for each parameter
for param in parameters:
acc_grad[param.id] += param.grad
# --- Parameter Update (averaging gradients over the batch) ---
for param in parameters:
# Update: new_param = old_param - learning_rate * (avg gradient)
param.data -= learning_rate * (acc_grad[param.id] / n_samples)
# Optionally, print the average loss every 100 epochs.
if epoch % 100 == 0:
print(f"Epoch {epoch}, Average Loss: {total_loss / n_samples}")
# Reset gradients in all parameters for the next epoch.
for param in parameters:
param.zero_grad()
# After training, you can print out the final parameters or test on new data.
print("Training complete.")
# Define a helper function to compute the network's output for given input values.
def network_output(x1_val, x2_val, w11, w12, b1, w21, w22, b2, v1, v2, b3):
# Create input Variables (no gradient needed for inference)
x1_var = Variable(x1_val, requires_grad=False)
x2_var = Variable(x2_val, requires_grad=False)
# Forward pass through the network
# Hidden layer
z1 = w11 * x1_var + w12 * x2_var + b1
z2 = w21 * x1_var + w22 * x2_var + b2
h1 = sigmoid(z1)
h2 = sigmoid(z2)
# Output layer
z_out = v1 * h1 + v2 * h2 + b3
o = sigmoid(z_out)
return o.data # Return the computed output (a probability)
# Create a grid over the input space based on the dataset X.
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
# Since our AD framework is not vectorized, we wrap the network_output function using np.vectorize.
vec_network_output = np.vectorize(
lambda a, b: network_output(a, b, w11, w12, b1, w21, w22, b2, v1, v2, b3)
)
# Compute the network's output over the entire grid.
Z = vec_network_output(xx, yy)
# Plot the decision boundary and the data.
plt.figure(figsize=(6, 5))
# Use contourf to fill regions where the output is below or above 0.5.
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.2, colors=['blue', 'red'])
plt.scatter(X[:, 0], X[:, 1], c=y_data, cmap='bwr', edgecolor='k')
plt.title("Decision Boundary after Training")
plt.xlabel("x1")
plt.ylabel("x2")
plt.show()
You can check the correctness yourself.
This is a basic implementation of reverse-mode automatic differentiation. Actual automatic differentiation frameworks are far more complex, as they need to handle matrices and multi-dimensional arrays, not just scalar variables.