Optimizers — Section 6: Training Modern Models

An optimizer turns gradients into parameter updates. Three matter in practice.

SGD (with momentum)

v_{t+1} = \mu v_t - \eta \nabla L, \quad w_{t+1} = w_t + v_{t+1}

The momentum term $\mu v_t$ accumulates past gradients, smoothing out noise and accelerating along consistent directions. Vanilla SGD is the workhorse for CNNs and supervised learning — pair with cosine LR schedule and you have the standard recipe.

Adam

m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t

v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2

\hat{m}_t = m_t / (1 - \beta_1^t), \quad \hat{v}_t = v_t / (1 - \beta_2^t)

w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Maintains per-parameter learning rates that adapt based on the recent gradient magnitude. The $\sqrt{\hat{v}_t}$ normalization makes Adam scale-invariant — large gradients on one parameter and small on another both get appropriate steps. The bias correction terms compensate for the moving averages being biased toward zero at the start.

Adam usually works well out of the box. Default $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{-8}$ .

AdamW

Standard Adam with weight decay applied differently:

w_{t+1} = w_t - \eta \left(\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda w_t\right)

The weight decay is *decoupled* from the gradient — it doesn't get scaled by $1/\sqrt{\hat{v}_t}$ . In practice this gives better generalization and is now standard for transformer training.

When to pick what

Vision CNNs: SGD with momentum + cosine schedule. Still beats Adam on ImageNet-class problems.
Transformers / language: AdamW.
Sparse data, embedding tables: Adagrad (older), or Adam with careful tuning.
Bayesian / second-order: L-BFGS for small models with full-batch gradients (rare in deep learning).