Regularization in DL — Section 20: Deep Learning

Deep nets have so many parameters that regularization isn't optional.

Weight decay penalizes $\|\theta\|^2$ , equivalent to $L_2$ regularization. It's almost always worth using; modern optimizers like AdamW separate it from the gradient update for cleaner behavior.

Dropout randomly zeros out a fraction of activations during training, forcing the network not to rely on any single neuron. It's roughly equivalent to ensembling many subnetworks. Use lower dropout in convolutional layers, higher (e.g. $0.5$ ) in fully-connected ones.

Early stopping monitors validation loss and stops training when it stops improving — a free regularizer that also saves compute.

Data augmentation, batch normalization, and label smoothing are also common regularizers in production deep-learning pipelines.

Deep nets have so many parameters that regularization isn't optional.

Weight decay penalizes $\|\theta\|^2$ , equivalent to $L_2$ regularization. It's almost always worth using; modern optimizers like AdamW separate it from the gradient update for cleaner behavior.

Early stopping monitors validation loss and stops training when it stops improving — a free regularizer that also saves compute.

Data augmentation, batch normalization, and label smoothing are also common regularizers in production deep-learning pipelines.