Training large models requires splitting work across many devices. Three orthogonal strategies, often combined.
Data parallelism
Each GPU has a full copy of the model and processes a different mini-batch. After backward, gradients are all-reduced across GPUs so every replica sees the average. Optimizer step happens on each replica with the same gradient.
- Effective batch size = per-GPU batch × number of GPUs.
- Scaling: linear up to communication-bandwidth limits. Beyond ~16–64 GPUs, all-reduce becomes the bottleneck unless you're using high-speed interconnects (NVLink, InfiniBand).
- Memory: model + activations + optimizer state replicated on every GPU. No savings.
ZeRO / FSDP (sharded data parallelism)
Shard the optimizer state, gradients, and (in ZeRO-3 / FSDP) parameters across GPUs. Each GPU only stores a slice. Parameters are gathered just-in-time before each layer's forward pass and freed after. Trades compute (extra communication) for memory.
ZeRO-3 / FSDP lets you train models that don't fit on a single GPU using only data parallelism logically. Standard for LLM training in 2024+.
Model (tensor) parallelism
Split a single layer's parameters across GPUs. For a linear layer , partition column-wise across GPUs; each GPU computes a partial result and then all-reduce. Megatron-LM style.
Useful for very wide layers (attention, large MLPs). Communication is heavier than data parallelism — typically limited to within a high-bandwidth node (NVLink).
Pipeline parallelism
Split the model into sequential stages, each living on a different GPU. Forward pass moves through stages like an assembly line; backward pass reverses. Without micro-batching, only one GPU is active at a time — "pipeline bubble." Micro-batches reduce the bubble but add complexity.
Used for the depth dimension — when even a single layer's worth of parameters fits on a GPU but the full model doesn't.
3D parallelism
Combine all three: data parallel across nodes, model parallel within a node, pipeline parallel across nodes. This is what trains GPT-3-class models. Frameworks: Megatron-LM, DeepSpeed.