RNNs and LSTMs — Section 5: Deep Learning Architectures

Recurrent networks process sequences by maintaining a hidden state $h_t$ that gets updated at each timestep:

h_t = \phi(W_h h_{t-1} + W_x x_t + b)

The same parameters $W_h, W_x, b$ are applied at every step — parameter sharing across time, analogous to a CNN's parameter sharing across space.

Why vanilla RNNs failed

Backprop through time multiplies many copies of $W_h$ together. If the spectral radius of $W_h$ is $< 1$ , gradients vanish; if $> 1$ , they explode. Plain RNNs can't reliably learn dependencies more than ~10 steps apart.

LSTM

The Long Short-Term Memory cell (Hochreiter & Schmidhuber 1997) replaces the simple hidden state with two streams (cell state $c_t$ and hidden state $h_t$ ) plus three gates:

Forget gate $f_t$ : how much of $c_{t-1}$ to keep.
Input gate $i_t$ : how much new information to write.
Output gate $o_t$ : how much of $c_t$ to expose as $h_t$ .

The cell state $c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$ creates a (mostly) linear path through time. Gradients flow back through this path largely unattenuated as long as the forget gate is open. Empirically, LSTMs can capture dependencies hundreds of steps long.

GRU

The Gated Recurrent Unit fuses the forget and input gates and skips the cell state. Slightly faster, often comparable accuracy. Use either; people don't fight about this.

What replaced them

For most sequence problems, transformers replaced LSTMs after 2017. Why:

Parallelism: RNNs are inherently sequential — each step depends on the previous. Training is slow.
Length scaling: attention attends to all positions in parallel; effective context length is set by compute and architecture, not by gradient flow.
Empirical performance: at scale, transformers win on language, vision, audio, biology.

RNNs/LSTMs still appear in resource-constrained settings (mobile, streaming inference) and in time-series forecasting where their bias toward Markovian structure helps.