Activations and Initialization — Section 4: Neural Networks

Two design choices that get glossed over but determine whether a deep network trains at all.

Activation functions

Sigmoid $\sigma(z) = 1/(1+e^{-z})$ . Saturates at both ends, kills gradients. Use only on output layer for binary classification.
Tanh $\tanh(z)$ . Zero-centered (better for gradient flow than sigmoid) but still saturates. Mostly historical.
ReLU $\max(0, z)$ . Zero gradient for $z < 0$ , identity for $z > 0$ . Dirt simple, no saturation on the positive side, sparse activations (free regularization). Default for everything from 2012 onward.
Leaky ReLU / PReLU: small slope for $z < 0$ to prevent "dead" neurons (units stuck at 0 with zero gradient).
GELU $z \cdot \Phi(z)$ . Smooth approximation of ReLU, used in transformers because it gives slightly better optimization.
Swish/SiLU $z \cdot \sigma(z)$ . Similar to GELU.

Initialization

Activations propagate forward. Gradients propagate backward. Both should have stable variance across layers — otherwise you get vanishing or exploding values.

For a layer with $n_{\text{in}}$ inputs and weights drawn from $\mathcal{N}(0, \sigma^2)$ , the output variance is $n_{\text{in}} \sigma^2 \cdot \text{Var}(x)$ . To preserve variance, set $\sigma^2 = 1/n_{\text{in}}$ .

Xavier/Glorot ( $\sigma^2 = 2/(n_{\text{in}} + n_{\text{out}})$ ): preserves variance for tanh.
Kaiming/He ( $\sigma^2 = 2/n_{\text{in}}$ ): preserves variance for ReLU (which zeros half the inputs, doubling the variance you need).

Modern frameworks use Kaiming by default for linear/conv layers when you don't specify.

Interaction with normalization

Batch norm, layer norm, and similar techniques re-normalize activations at runtime, making the network more robust to bad initialization. They don't replace good initialization — they reduce the cost of getting it slightly wrong. Without normalization, a sloppy init can make a deep network fail to train at all.

Two design choices that get glossed over but determine whether a deep network trains at all.

Activation functions

Sigmoid $\sigma(z) = 1/(1+e^{-z})$ . Saturates at both ends, kills gradients. Use only on output layer for binary classification.
Tanh $\tanh(z)$ . Zero-centered (better for gradient flow than sigmoid) but still saturates. Mostly historical.
ReLU $\max(0, z)$ . Zero gradient for $z < 0$ , identity for $z > 0$ . Dirt simple, no saturation on the positive side, sparse activations (free regularization). Default for everything from 2012 onward.
Leaky ReLU / PReLU: small slope for $z < 0$ to prevent "dead" neurons (units stuck at 0 with zero gradient).
GELU $z \cdot \Phi(z)$ . Smooth approximation of ReLU, used in transformers because it gives slightly better optimization.
Swish/SiLU $z \cdot \sigma(z)$ . Similar to GELU.

Initialization

Activations propagate forward. Gradients propagate backward. Both should have stable variance across layers — otherwise you get vanishing or exploding values.

Xavier/Glorot ( $\sigma^2 = 2/(n_{\text{in}} + n_{\text{out}})$ ): preserves variance for tanh.
Kaiming/He ( $\sigma^2 = 2/n_{\text{in}}$ ): preserves variance for ReLU (which zeros half the inputs, doubling the variance you need).

Modern frameworks use Kaiming by default for linear/conv layers when you don't specify.