Backpropagation computes the gradient of a scalar loss with respect to every parameter in a network in time roughly equal to one forward pass. It's not magic; it's the chain rule, executed in the right order.
The chain rule in reverse
For a composition , the chain rule gives:
You can multiply these Jacobians left-to-right (forward mode) or right-to-left (reverse mode). For a function (scalar loss, many parameters), reverse mode is dramatically more efficient — you only need vector-Jacobian products, not full Jacobians.
What gets stored
Forward pass: compute and cache every intermediate activation. Backward pass: traverse the computation graph in reverse, multiplying gradients by the cached activations to get parameter gradients. Memory cost scales with the number of activations stored — which is why gradient checkpointing (recompute activations during backward) saves memory at the cost of compute.
Per-layer formulas
For a linear layer and incoming gradient :
The first two are what you use to update parameters; the third is what you pass to the next layer back.
Vanishing and exploding gradients
When many Jacobians multiply, their spectral properties compound. If activation derivatives are consistently (sigmoid in saturation), gradients shrink exponentially with depth — early layers receive almost no signal. If they're , gradients explode. ReLU, residual connections, batch normalization, and careful initialization were all developed to fight this.
What automatic differentiation buys you
You don't write backprop by hand anymore. Frameworks (PyTorch, JAX) record the computation graph during the forward pass and traverse it backward automatically. Knowing what's happening underneath lets you debug gradient bugs and understand memory costs.