Training Deep Networks — Section 9: Neural Networks

Training a deep network end-to-end is harder than the math suggests. A pile of engineering techniques — most discovered empirically — make modern deep learning work.

Initialization

Random weights drawn from a sensibly-scaled distribution. Glorot/Xavier for tanh; He for ReLU. Bad initialization (e.g., all zeros, all the same value) causes all neurons in a layer to compute the same function, killing learning.

Batch normalization

Normalize activations within each mini-batch to have zero mean and unit variance, then apply a learnable scale and shift. Dramatically accelerates training and acts as a regularizer. Side effect: ties batch behavior to model behavior, complicating small-batch settings.

Layer normalization

Like batch norm but normalizes across features instead of across the batch. Used in transformers and most modern NLP models. Doesn't depend on batch size — works for batch size 1.

Dropout

Randomly zero out a fraction of activations during training. Forces the network to not rely on any single feature → regularization. Disabled at inference.

Learning rate scheduling

Constant LR works but isn't optimal. Common schedules:

Step decay: divide LR by 10 every K epochs
Cosine: smooth decay to a minimum, sometimes with warm restarts
Warmup: start small, ramp up, then decay — helps transformers and large-batch training

Gradient clipping

Cap the gradient's L2 norm at some threshold. Prevents exploding gradients, especially in RNNs and language models.

Early stopping

Monitor validation loss; stop training when it stops improving (with some patience). A free regularizer.

Mixed precision

Compute most operations in fp16 (or bf16), keep a master copy of weights in fp32. Halves memory, ~2x faster on GPUs with tensor cores. Essentially free accuracy.

When training stalls

Check the obvious before tweaking: learning rate too high (loss explodes) or too low (training crawls), labels wrong, gradients NaN, no data augmentation when overfitting. Plot loss curves and gradient norms — they reveal most failure modes.