Backpropagation — Section 20: Deep Learning

Backpropagation computes gradients of a loss with respect to all network parameters by repeatedly applying the chain rule, starting from the output and moving backward through the layers.

For a network $L = f_n(\dots f_2(f_1(x))\dots)$ , the gradient with respect to layer $l$ uses gradients computed in layer $l+1$ :

\frac{\partial L}{\partial \theta_l} = \frac{\partial L}{\partial h_l} \cdot \frac{\partial h_l}{\partial \theta_l}

The "backward pass" reuses intermediate computations from the forward pass — without that reuse, gradients would cost $O(\text{depth}^2)$ per parameter. With it, the cost is comparable to a single forward pass.

Modern frameworks (PyTorch, JAX, TensorFlow) handle backprop automatically via autograd. The only thing you really need to specify is the forward pass; everything else is computed via reverse-mode differentiation through the graph.

Practical issues: vanishing gradients in deep nets (mostly fixed by ReLU and residual connections), exploding gradients in RNNs (use gradient clipping), and numerical instability with poorly scaled inputs.

Backpropagation computes gradients of a loss with respect to all network parameters by repeatedly applying the chain rule, starting from the output and moving backward through the layers.

For a network $L = f_n(\dots f_2(f_1(x))\dots)$ , the gradient with respect to layer $l$ uses gradients computed in layer $l+1$ :

\frac{\partial L}{\partial \theta_l} = \frac{\partial L}{\partial h_l} \cdot \frac{\partial h_l}{\partial \theta_l}