The learning rate sets the size of each parameter update. Get it wrong and nothing trains. Get the *schedule* wrong and accuracy can drop several points.
Why a schedule
- Early training: random initialization → gradients can be erratic. Too-large diverges; too-small wastes compute. A short warmup ramps from 0 to peak over the first few hundred to few thousand steps.
- Middle training: stable, model has caught the main signal. Keep large for fast progress.
- Late training: fine-tuning the basin. Drop to take small steps and converge.
Common schedules
- Step decay: divide by 10 every epochs. Simple, robust. Sharp drops can disrupt training; smooth schedules are usually better.
- Cosine with warmup: ramps up, then . Smooth, no manual tuning of decay points. Standard for vision and large-scale training.
- Linear with warmup: decay linearly to 0. Very common for transformers; close to optimal in many setups.
- Inverse square root: after warmup. From the original Transformer paper; still used in some NLP setups.
Warmup matters more than people think
Skipping warmup is the most common cause of "my model diverged in the first 100 steps." Adam and LayerNorm-heavy architectures especially need it. A few thousand warmup steps cost nothing relative to total training time.
Cyclical schedules
Some training recipes (1cycle, SGDR with restarts) raise back up later in training. Lets the optimizer escape sharp minima it might have settled into. Less standard but occasionally produces noticeable gains.
What I do in practice
For most problems: linear warmup (3–10% of total steps) → cosine decay to ~10% of peak. Sweep peak LR over 4 values on a small fraction of training and pick the best.