Attention — Section 5: Deep Learning Architectures

Attention is a soft, content-based lookup. Given a query $q$ , a set of keys $\{k_i\}$ , and corresponding values $\{v_i\}$ , the output is a weighted average of values where weights are computed from query-key similarity:

\text{Attention}(q, K, V) = \text{softmax}\left(\frac{q K^T}{\sqrt{d_k}}\right) V

Why "soft"

Hard attention picks one key. Soft attention puts a probability over keys and averages — differentiable end-to-end. The softmax makes the weights sum to 1 and concentrates mass on the highest-similarity key.

Why scale by $\sqrt{d_k}$

Without it, dot products $qk_i$ grow with dimension, pushing softmax into saturation. Inside saturation, gradients vanish. Dividing by $\sqrt{d_k}$ keeps the variance of $q k_i$ around 1 regardless of dimension.

Self-attention

Set $Q, K, V$ all to projections of the same input sequence $X$ :

Q = X W_Q, \quad K = X W_K, \quad V = X W_V

Each position queries every other position. The output at position $i$ is a mixture of values at all positions, weighted by content similarity. Crucially: every position attends to every position in a single layer, no matter how far apart. This is what RNNs can't do without depth blowing up.

Multi-head attention

Project $Q, K, V$ into $h$ different subspaces and apply attention in each. Concatenate the outputs and project back. Different heads can attend to different patterns (syntactic position, coreference, content). Strictly more expressive than single-head, at the cost of slightly more parameters.

What attention is and isn't

Attention is not a replacement for compute — it spreads compute over input. Quadratic time and memory in sequence length is its big weakness. Long-context architectures (sparse attention, linear attention, state-space models) are still active research.

\text{Attention}(q, K, V) = \text{softmax}\left(\frac{q K^T}{\sqrt{d_k}}\right) V

Why "soft"

Why scale by $\sqrt{d_k}$

Self-attention

Set $Q, K, V$ all to projections of the same input sequence $X$ :

Q = X W_Q, \quad K = X W_K, \quad V = X W_V