Regularization for Deep Networks — Section 4: Neural Networks

Modern networks are massively overparameterized. Without regularization they overfit aggressively. Five techniques cover most of what you'll see in practice.

Weight decay

Add $\lambda \sum_w w^2$ to the loss. Equivalent to $L_2$ regularization. In SGD, the update becomes $w \leftarrow (1 - \eta \lambda) w - \eta \nabla L$ — every step shrinks weights toward zero. The standard regularizer. Almost always on; $\lambda \in [10^{-5}, 10^{-3}]$ depending on architecture and dataset size.

Dropout

During training, randomly zero each unit's output with probability $p$ . At test time, scale outputs by $1-p$ (or scale activations by $1/(1-p)$ at train time — "inverted dropout"). Forces the network not to rely on any specific unit; approximately equivalent to training an ensemble of subnetworks. Best on fully-connected layers with lots of parameters; less useful inside convolutional layers (use only on the final classifier head).

Batch normalization

Normalize each layer's pre-activations across the batch: $\hat{z} = (z - \mu_B)/\sigma_B$ , then apply learnable scale and shift $\gamma \hat{z} + \beta$ . Stabilizes training, lets you use higher learning rates, and acts as a mild regularizer (the batch statistics inject noise). At test time, use running averages of $\mu_B, \sigma_B$ .

Limitations: dependent on batch size; behaves badly in distributed training with very small per-device batches; complicates online learning. Layer norm and group norm avoid these issues and are standard in transformers.

Data augmentation

Generate transformed versions of training examples that preserve the label. For images: crops, flips, color jitter, cutout. For text: back-translation, random insertion. This is often the single biggest regularizer. It increases effective training set size and forces the model to learn invariances directly.

Early stopping

Monitor validation loss; stop when it starts increasing. Effectively a regularizer (training longer = more capacity used = more risk of overfitting). Free and almost always worth doing.

When to use what

Tiny dataset → augmentation, dropout, heavy weight decay, early stopping
Large dataset, big model → weight decay + early stopping is often enough
CNN on images → augmentation is non-negotiable
Transformer on text → dropout + layer norm + early stopping