Modern GPUs (V100+) have specialized hardware for half-precision arithmetic that's significantly faster than fp32. Mixed-precision training stores activations and weights in fp16 or bf16 but maintains a master copy in fp32 for safe updates.
FP16 vs BF16
- FP16: 1 sign + 5 exponent + 10 mantissa bits. Narrow dynamic range . Easy to underflow during backward pass.
- BF16: 1 sign + 8 exponent + 7 mantissa bits. Same exponent range as fp32 (10^-38, 10^38) but less precision. Much easier to use safely.
For training: BF16 if your hardware supports it (Ampere, Hopper, TPU). FP16 with loss scaling otherwise.
Loss scaling (FP16)
Multiply the loss by a large constant before backward, scaling all gradients up by . Divide by before applying. This shifts gradient values into FP16's representable range, preventing underflow in small gradients. PyTorch's `GradScaler` does this automatically.
BF16 doesn't need loss scaling — its exponent range already covers everything you'd see.
Mixed-precision recipe
1. Keep an fp32 master copy of weights. 2. Cast to half-precision for forward pass; activations are half-precision. 3. Compute loss; apply loss scaling (FP16 only). 4. Backward pass produces half-precision gradients. 5. Cast gradients to fp32; divide out the loss scale. 6. Apply optimizer update to the fp32 master weights. 7. Cast updated weights back to half-precision for the next forward pass.
What you gain
- 2x memory headroom: bigger batches, longer sequences.
- 1.5–3x throughput on supported hardware.
- Practically no accuracy loss on most workloads (with BF16; FP16 occasionally needs hyperparameter retuning).
What you have to watch
- Operations that are numerically sensitive: layer norm statistics, log-softmax in attention, softmax — keep these in fp32 explicitly.
- Loss scaling can run into Inf/NaN if too aggressive. Modern dynamic scalers (PyTorch's `GradScaler`) handle this.