Gradient Descent and Variants — Section 18: Calculus & Optimization

Gradient descent minimizes $f(x)$ by stepping in the direction of steepest decrease:

x_{k+1} = x_k - \eta\, \nabla f(x_k)

The step size $\eta$ — the learning rate — is the main hyperparameter. Too small means slow convergence; too large can diverge. For convex $f$ with Lipschitz gradient, $\eta = 1/L$ guarantees convergence.

Variants address GD's weaknesses. Stochastic gradient descent (SGD) replaces the full gradient with a single-sample estimate — noisy but cheap, and the noise actually helps escape saddle points. Momentum adds memory of past gradients to smooth out oscillations. Adam adapts the learning rate per parameter using exponential moving averages of gradients and squared gradients.

In quant ML, Adam and its cousins are the default for training neural networks. For convex problems with small datasets, second-order methods (Newton, quasi-Newton like L-BFGS) often converge in far fewer iterations.

Gradient descent minimizes $f(x)$ by stepping in the direction of steepest decrease:

x_{k+1} = x_k - \eta\, \nabla f(x_k)