Training a deep network end-to-end is harder than the math suggests. A pile of engineering techniques — most discovered empirically — make modern deep learning work.
Initialization
Random weights drawn from a sensibly-scaled distribution. Glorot/Xavier for tanh; He for ReLU. Bad initialization (e.g., all zeros, all the same value) causes all neurons in a layer to compute the same function, killing learning.
Batch normalization
Normalize activations within each mini-batch to have zero mean and unit variance, then apply a learnable scale and shift. Dramatically accelerates training and acts as a regularizer. Side effect: ties batch behavior to model behavior, complicating small-batch settings.
Layer normalization
Like batch norm but normalizes across features instead of across the batch. Used in transformers and most modern NLP models. Doesn't depend on batch size — works for batch size 1.
Dropout
Randomly zero out a fraction of activations during training. Forces the network to not rely on any single feature → regularization. Disabled at inference.
Learning rate scheduling
Constant LR works but isn't optimal. Common schedules:
- Step decay: divide LR by 10 every K epochs
- Cosine: smooth decay to a minimum, sometimes with warm restarts
- Warmup: start small, ramp up, then decay — helps transformers and large-batch training
Gradient clipping
Cap the gradient's L2 norm at some threshold. Prevents exploding gradients, especially in RNNs and language models.
Early stopping
Monitor validation loss; stop training when it stops improving (with some patience). A free regularizer.
Mixed precision
Compute most operations in fp16 (or bf16), keep a master copy of weights in fp32. Halves memory, ~2x faster on GPUs with tensor cores. Essentially free accuracy.
When training stalls
Check the obvious before tweaking: learning rate too high (loss explodes) or too low (training crawls), labels wrong, gradients NaN, no data augmentation when overfitting. Plot loss curves and gradient norms — they reveal most failure modes.