Vanilla SGD updates each parameter by subtracting a multiple of its gradient. It works but can be slow when curvature varies widely across parameters.
Momentum smooths updates by accumulating an exponential moving average of past gradients. RMSProp and Adam additionally normalize by an EMA of squared gradients, effectively giving each parameter its own adaptive learning rate. Adam is the de facto default for deep nets, though SGD with momentum and a tuned schedule can match or beat it on some problems.
Learning rate schedules matter as much as the optimizer:
- Warmup ramps up from a small initial rate over a few thousand steps, preventing early instability.
- Cosine or step decay reduces the rate over time as the network converges.
In practice, cycling between regimes (warmup → high LR → decay) is more robust than picking one and sticking with it.