Attention is a soft, content-based lookup. Given a query , a set of keys , and corresponding values , the output is a weighted average of values where weights are computed from query-key similarity:
Why "soft"
Hard attention picks one key. Soft attention puts a probability over keys and averages — differentiable end-to-end. The softmax makes the weights sum to 1 and concentrates mass on the highest-similarity key.
Why scale by $\sqrt{d_k}$
Without it, dot products grow with dimension, pushing softmax into saturation. Inside saturation, gradients vanish. Dividing by keeps the variance of around 1 regardless of dimension.
Self-attention
Set all to projections of the same input sequence :
Each position queries every other position. The output at position is a mixture of values at all positions, weighted by content similarity. Crucially: every position attends to every position in a single layer, no matter how far apart. This is what RNNs can't do without depth blowing up.
Multi-head attention
Project into different subspaces and apply attention in each. Concatenate the outputs and project back. Different heads can attend to different patterns (syntactic position, coreference, content). Strictly more expressive than single-head, at the cost of slightly more parameters.
What attention is and isn't
Attention is not a replacement for compute — it spreads compute over input. Quadratic time and memory in sequence length is its big weakness. Long-context architectures (sparse attention, linear attention, state-space models) are still active research.