Recurrent networks process sequences by maintaining a hidden state that gets updated at each timestep:
The same parameters are applied at every step — parameter sharing across time, analogous to a CNN's parameter sharing across space.
Why vanilla RNNs failed
Backprop through time multiplies many copies of together. If the spectral radius of is , gradients vanish; if , they explode. Plain RNNs can't reliably learn dependencies more than ~10 steps apart.
LSTM
The Long Short-Term Memory cell (Hochreiter & Schmidhuber 1997) replaces the simple hidden state with two streams (cell state and hidden state ) plus three gates:
- Forget gate : how much of to keep.
- Input gate : how much new information to write.
- Output gate : how much of to expose as .
The cell state creates a (mostly) linear path through time. Gradients flow back through this path largely unattenuated as long as the forget gate is open. Empirically, LSTMs can capture dependencies hundreds of steps long.
GRU
The Gated Recurrent Unit fuses the forget and input gates and skips the cell state. Slightly faster, often comparable accuracy. Use either; people don't fight about this.
What replaced them
For most sequence problems, transformers replaced LSTMs after 2017. Why:
- Parallelism: RNNs are inherently sequential — each step depends on the previous. Training is slow.
- Length scaling: attention attends to all positions in parallel; effective context length is set by compute and architecture, not by gradient flow.
- Empirical performance: at scale, transformers win on language, vision, audio, biology.
RNNs/LSTMs still appear in resource-constrained settings (mobile, streaming inference) and in time-series forecasting where their bias toward Markovian structure helps.