Transformers are the dominant architecture for sequence modeling — language, music, code, even images. The breakthrough was attention, a mechanism that lets each output position selectively read from all input positions.
Self-attention
Given a sequence of vectors , compute query (), key (), and value () projections: , , . The attention output is:
Each output position is a weighted sum of values, with weights determined by how each query "matches" each key. The prevents the softmax from saturating in high dimensions.
Multi-head attention
Run multiple attention computations in parallel with different learned projections. Concatenate and project. Each "head" can attend to different patterns — short-range, long-range, syntactic, semantic.
Position information
Self-attention is permutation-equivariant — it has no notion of order without positional encodings. Add fixed sinusoidal or learned positional embeddings to the input.
Encoder, decoder, both
- Encoder-only (BERT): bidirectional attention, good for understanding tasks (classification, embedding)
- Decoder-only (GPT): causal attention (can't see future tokens), good for generation
- Encoder-decoder (T5, original transformer): translation, summarization
Modern LLMs are almost all decoder-only — turns out a single architecture handles both understanding and generation if scaled.
Why transformers won
- Parallelism: unlike RNNs, all positions are processed simultaneously
- Long-range dependencies: any output can read any input directly (no information loss through intermediate steps)
- Scaling: empirically, transformer quality scales smoothly with parameters, data, and compute — predictable returns on investment
- Generality: works for text, images (Vision Transformer), audio, code, protein structures
Drawbacks
- Quadratic memory and compute in sequence length — context windows are expensive to scale (active research area)
- Data hungry — small datasets favor RNNs / CNNs
- Position handling is a hack relative to RNNs' natural sequentiality