Mixed-Precision Training — Section 6: Training Modern Models

Modern GPUs (V100+) have specialized hardware for half-precision arithmetic that's significantly faster than fp32. Mixed-precision training stores activations and weights in fp16 or bf16 but maintains a master copy in fp32 for safe updates.

FP16 vs BF16

FP16: 1 sign + 5 exponent + 10 mantissa bits. Narrow dynamic range $[10^{-5}, 65504]$ . Easy to underflow during backward pass.
BF16: 1 sign + 8 exponent + 7 mantissa bits. Same exponent range as fp32 (10^-38, 10^38) but less precision. Much easier to use safely.

For training: BF16 if your hardware supports it (Ampere, Hopper, TPU). FP16 with loss scaling otherwise.

Loss scaling (FP16)

Multiply the loss by a large constant $S$ before backward, scaling all gradients up by $S$ . Divide by $S$ before applying. This shifts gradient values into FP16's representable range, preventing underflow in small gradients. PyTorch's `GradScaler` does this automatically.

BF16 doesn't need loss scaling — its exponent range already covers everything you'd see.

Mixed-precision recipe

1. Keep an fp32 master copy of weights. 2. Cast to half-precision for forward pass; activations are half-precision. 3. Compute loss; apply loss scaling (FP16 only). 4. Backward pass produces half-precision gradients. 5. Cast gradients to fp32; divide out the loss scale. 6. Apply optimizer update to the fp32 master weights. 7. Cast updated weights back to half-precision for the next forward pass.

What you gain

2x memory headroom: bigger batches, longer sequences.
1.5–3x throughput on supported hardware.
Practically no accuracy loss on most workloads (with BF16; FP16 occasionally needs hyperparameter retuning).

What you have to watch

Operations that are numerically sensitive: layer norm statistics, log-softmax in attention, softmax — keep these in fp32 explicitly.
Loss scaling can run into Inf/NaN if too aggressive. Modern dynamic scalers (PyTorch's `GradScaler`) handle this.