Modern networks are massively overparameterized. Without regularization they overfit aggressively. Five techniques cover most of what you'll see in practice.
Weight decay
Add to the loss. Equivalent to regularization. In SGD, the update becomes — every step shrinks weights toward zero. The standard regularizer. Almost always on; depending on architecture and dataset size.
Dropout
During training, randomly zero each unit's output with probability . At test time, scale outputs by (or scale activations by at train time — "inverted dropout"). Forces the network not to rely on any specific unit; approximately equivalent to training an ensemble of subnetworks. Best on fully-connected layers with lots of parameters; less useful inside convolutional layers (use only on the final classifier head).
Batch normalization
Normalize each layer's pre-activations across the batch: , then apply learnable scale and shift . Stabilizes training, lets you use higher learning rates, and acts as a mild regularizer (the batch statistics inject noise). At test time, use running averages of .
Limitations: dependent on batch size; behaves badly in distributed training with very small per-device batches; complicates online learning. Layer norm and group norm avoid these issues and are standard in transformers.
Data augmentation
Generate transformed versions of training examples that preserve the label. For images: crops, flips, color jitter, cutout. For text: back-translation, random insertion. This is often the single biggest regularizer. It increases effective training set size and forces the model to learn invariances directly.
Early stopping
Monitor validation loss; stop when it starts increasing. Effectively a regularizer (training longer = more capacity used = more risk of overfitting). Free and almost always worth doing.
When to use what
- Tiny dataset → augmentation, dropout, heavy weight decay, early stopping
- Large dataset, big model → weight decay + early stopping is often enough
- CNN on images → augmentation is non-negotiable
- Transformer on text → dropout + layer norm + early stopping