Transformers — Section 5: Deep Learning Architectures

A transformer block applies self-attention followed by a per-position feedforward network, with residual connections and layer normalization around each. Stack many blocks and you have the architecture that runs almost all of modern NLP, much of vision, and increasingly other modalities.

Block structure

y = x + \text{Attention}(\text{LN}(x))

z = y + \text{FFN}(\text{LN}(y))

The residual + LN pattern is essential — both for gradient flow and for the optimization landscape it creates.

Positional encoding

Self-attention is permutation-equivariant: shuffling the inputs shuffles the outputs identically. To inject sequence order, add (or otherwise mix in) positional encodings:

Sinusoidal (original transformer): fixed sinusoidal functions of position at multiple frequencies.
Learned absolute: per-position vectors trained alongside the model.
Relative (T5, ALiBi): bias attention scores based on relative position. Generalizes better to longer sequences than learned absolutes.
RoPE (LLaMA family): rotate query/key vectors by an angle that depends on position. Best generalization to longer contexts in practice.

Encoder, decoder, encoder-decoder

Encoder-only (BERT): bidirectional attention. Used for understanding tasks — classification, NER, retrieval.
Decoder-only (GPT family): causal attention (each position only sees previous positions). Used for generation. The standard for LLMs.
Encoder-decoder (T5, original Transformer): encoder processes input, decoder generates output attending to both its own prefix and the encoder output. Used for translation, summarization.

Pre-LN vs Post-LN

The original paper put LN *after* the residual. Modern practice puts it *before*, inside the residual branch ("pre-LN"). Pre-LN trains more stably and reaches better minima at scale — most modern LLMs use it.

Why transformers won

Parallelism: every position computed in parallel across the sequence; perfect fit for GPUs.
Scaling: doubling parameters and data continues to improve performance reliably. Scaling laws (Hoffmann et al., Kaplan et al.) gave a recipe for spending compute well.
Generality: same architecture handles text, code, images, audio, proteins, etc. The pretraining recipe is the new universal feature extractor.