Distributed Training — Section 6: Training Modern Models

Training large models requires splitting work across many devices. Three orthogonal strategies, often combined.

Data parallelism

Each GPU has a full copy of the model and processes a different mini-batch. After backward, gradients are all-reduced across GPUs so every replica sees the average. Optimizer step happens on each replica with the same gradient.

Effective batch size = per-GPU batch × number of GPUs.
Scaling: linear up to communication-bandwidth limits. Beyond ~16–64 GPUs, all-reduce becomes the bottleneck unless you're using high-speed interconnects (NVLink, InfiniBand).
Memory: model + activations + optimizer state replicated on every GPU. No savings.

ZeRO / FSDP (sharded data parallelism)

Shard the optimizer state, gradients, and (in ZeRO-3 / FSDP) parameters across GPUs. Each GPU only stores a slice. Parameters are gathered just-in-time before each layer's forward pass and freed after. Trades compute (extra communication) for memory.

ZeRO-3 / FSDP lets you train models that don't fit on a single GPU using only data parallelism logically. Standard for LLM training in 2024+.

Model (tensor) parallelism

Split a single layer's parameters across GPUs. For a linear layer $y = Wx$ , partition $W$ column-wise across GPUs; each GPU computes a partial result and then all-reduce. Megatron-LM style.

Useful for very wide layers (attention, large MLPs). Communication is heavier than data parallelism — typically limited to within a high-bandwidth node (NVLink).

Pipeline parallelism

Split the model into sequential stages, each living on a different GPU. Forward pass moves through stages like an assembly line; backward pass reverses. Without micro-batching, only one GPU is active at a time — "pipeline bubble." Micro-batches reduce the bubble but add complexity.

Used for the depth dimension — when even a single layer's worth of parameters fits on a GPU but the full model doesn't.

3D parallelism

Combine all three: data parallel across nodes, model parallel within a node, pipeline parallel across nodes. This is what trains GPT-3-class models. Frameworks: Megatron-LM, DeepSpeed.