An optimizer turns gradients into parameter updates. Three matter in practice.
SGD (with momentum)
The momentum term accumulates past gradients, smoothing out noise and accelerating along consistent directions. Vanilla SGD is the workhorse for CNNs and supervised learning — pair with cosine LR schedule and you have the standard recipe.
Adam
Maintains per-parameter learning rates that adapt based on the recent gradient magnitude. The normalization makes Adam scale-invariant — large gradients on one parameter and small on another both get appropriate steps. The bias correction terms compensate for the moving averages being biased toward zero at the start.
Adam usually works well out of the box. Default .
AdamW
Standard Adam with weight decay applied differently:
The weight decay is *decoupled* from the gradient — it doesn't get scaled by . In practice this gives better generalization and is now standard for transformer training.
When to pick what
- Vision CNNs: SGD with momentum + cosine schedule. Still beats Adam on ImageNet-class problems.
- Transformers / language: AdamW.
- Sparse data, embedding tables: Adagrad (older), or Adam with careful tuning.
- Bayesian / second-order: L-BFGS for small models with full-batch gradients (rare in deep learning).