Transformers and Attention — Section 9: Neural Networks

Transformers are the dominant architecture for sequence modeling — language, music, code, even images. The breakthrough was attention, a mechanism that lets each output position selectively read from all input positions.

Self-attention

Given a sequence of $n$ vectors $X$ , compute query ( $Q$ ), key ( $K$ ), and value ( $V$ ) projections: $Q = XW_Q$ , $K = XW_K$ , $V = XW_V$ . The attention output is:

\text{Attn}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d_k}) V

Each output position is a weighted sum of values, with weights determined by how each query "matches" each key. The $\sqrt{d_k}$ prevents the softmax from saturating in high dimensions.

Multi-head attention

Run multiple attention computations in parallel with different learned projections. Concatenate and project. Each "head" can attend to different patterns — short-range, long-range, syntactic, semantic.

Position information

Self-attention is permutation-equivariant — it has no notion of order without positional encodings. Add fixed sinusoidal or learned positional embeddings to the input.

Encoder, decoder, both

Encoder-only (BERT): bidirectional attention, good for understanding tasks (classification, embedding)
Decoder-only (GPT): causal attention (can't see future tokens), good for generation
Encoder-decoder (T5, original transformer): translation, summarization

Modern LLMs are almost all decoder-only — turns out a single architecture handles both understanding and generation if scaled.

Why transformers won

Parallelism: unlike RNNs, all positions are processed simultaneously
Long-range dependencies: any output can read any input directly (no information loss through intermediate steps)
Scaling: empirically, transformer quality scales smoothly with parameters, data, and compute — predictable returns on investment
Generality: works for text, images (Vision Transformer), audio, code, protein structures

Drawbacks

Quadratic memory and compute in sequence length — context windows are expensive to scale (active research area)
Data hungry — small datasets favor RNNs / CNNs
Position handling is a hack relative to RNNs' natural sequentiality