Backpropagation — Section 4: Neural Networks

Backpropagation computes the gradient of a scalar loss with respect to every parameter in a network in time roughly equal to one forward pass. It's not magic; it's the chain rule, executed in the right order.

The chain rule in reverse

For a composition $z = f_L(f_{L-1}(\dots f_1(x)))$ , the chain rule gives:

\frac{\partial z}{\partial x} = \frac{\partial f_L}{\partial f_{L-1}} \cdot \frac{\partial f_{L-1}}{\partial f_{L-2}} \cdot \dots \cdot \frac{\partial f_1}{\partial x}

You can multiply these Jacobians left-to-right (forward mode) or right-to-left (reverse mode). For a function $\mathbb{R}^n \to \mathbb{R}^1$ (scalar loss, many parameters), reverse mode is dramatically more efficient — you only need vector-Jacobian products, not full Jacobians.

What gets stored

Forward pass: compute and cache every intermediate activation. Backward pass: traverse the computation graph in reverse, multiplying gradients by the cached activations to get parameter gradients. Memory cost scales with the number of activations stored — which is why gradient checkpointing (recompute activations during backward) saves memory at the cost of compute.

Per-layer formulas

For a linear layer $y = Wx + b$ and incoming gradient $\frac{\partial L}{\partial y}$ :

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \cdot x^T$
$\frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}$
$\frac{\partial L}{\partial x} = W^T \cdot \frac{\partial L}{\partial y}$

The first two are what you use to update parameters; the third is what you pass to the next layer back.

Vanishing and exploding gradients

When many Jacobians multiply, their spectral properties compound. If activation derivatives are consistently $< 1$ (sigmoid in saturation), gradients shrink exponentially with depth — early layers receive almost no signal. If they're $> 1$ , gradients explode. ReLU, residual connections, batch normalization, and careful initialization were all developed to fight this.

What automatic differentiation buys you

You don't write backprop by hand anymore. Frameworks (PyTorch, JAX) record the computation graph during the forward pass and traverse it backward automatically. Knowing what's happening underneath lets you debug gradient bugs and understand memory costs.