Two design choices that get glossed over but determine whether a deep network trains at all.
Activation functions
- Sigmoid . Saturates at both ends, kills gradients. Use only on output layer for binary classification.
- Tanh . Zero-centered (better for gradient flow than sigmoid) but still saturates. Mostly historical.
- ReLU . Zero gradient for , identity for . Dirt simple, no saturation on the positive side, sparse activations (free regularization). Default for everything from 2012 onward.
- Leaky ReLU / PReLU: small slope for to prevent "dead" neurons (units stuck at 0 with zero gradient).
- GELU . Smooth approximation of ReLU, used in transformers because it gives slightly better optimization.
- Swish/SiLU . Similar to GELU.
Initialization
Activations propagate forward. Gradients propagate backward. Both should have stable variance across layers — otherwise you get vanishing or exploding values.
For a layer with inputs and weights drawn from , the output variance is . To preserve variance, set .
- Xavier/Glorot (): preserves variance for tanh.
- Kaiming/He (): preserves variance for ReLU (which zeros half the inputs, doubling the variance you need).
Modern frameworks use Kaiming by default for linear/conv layers when you don't specify.
Interaction with normalization
Batch norm, layer norm, and similar techniques re-normalize activations at runtime, making the network more robust to bad initialization. They don't replace good initialization — they reduce the cost of getting it slightly wrong. Without normalization, a sloppy init can make a deep network fail to train at all.