Learning Rate Schedules — Section 6: Training Modern Models

The learning rate sets the size of each parameter update. Get it wrong and nothing trains. Get the *schedule* wrong and accuracy can drop several points.

Why a schedule

Early training: random initialization → gradients can be erratic. Too-large $\eta$ diverges; too-small wastes compute. A short warmup ramps from 0 to peak over the first few hundred to few thousand steps.
Middle training: stable, model has caught the main signal. Keep $\eta$ large for fast progress.
Late training: fine-tuning the basin. Drop $\eta$ to take small steps and converge.

Common schedules

Step decay: divide $\eta$ by 10 every $N$ epochs. Simple, robust. Sharp drops can disrupt training; smooth schedules are usually better.
Cosine with warmup: ramps up, then $\eta_t = \frac{1}{2}\eta_{\max}(1 + \cos(\pi t/T))$ . Smooth, no manual tuning of decay points. Standard for vision and large-scale training.
Linear with warmup: decay $\eta$ linearly to 0. Very common for transformers; close to optimal in many setups.
Inverse square root: $\eta_t \propto 1/\sqrt{t}$ after warmup. From the original Transformer paper; still used in some NLP setups.

Warmup matters more than people think

Skipping warmup is the most common cause of "my model diverged in the first 100 steps." Adam and LayerNorm-heavy architectures especially need it. A few thousand warmup steps cost nothing relative to total training time.

Cyclical schedules

Some training recipes (1cycle, SGDR with restarts) raise $\eta$ back up later in training. Lets the optimizer escape sharp minima it might have settled into. Less standard but occasionally produces noticeable gains.

What I do in practice

For most problems: linear warmup (3–10% of total steps) → cosine decay to ~10% of peak. Sweep peak LR over 4 values on a small fraction of training and pick the best.