Training optimizes for accuracy. Serving optimizes for accuracy *under* latency, throughput, and cost constraints. Different problem.
Batch vs online
- Batch scoring: precompute predictions for many entities ahead of time (every hour, every night). Latency at prediction time = single lookup. Use when inputs are predictable and don't change often (per-user daily recommendations, churn scores).
- Online (request-time) scoring: compute predictions on demand. Required when inputs include real-time signals (current session, current cart, dynamic context).
Hybrid: precompute embeddings for users and items offline; combine online with cheap operations (dot product) at request time. Almost every recommendation system is built this way.
Latency budget
For a 100ms total budget:
- 10ms: network round trip
- 30ms: feature fetch + preprocessing
- 50ms: model inference
- 10ms: post-processing + serialization
This adds up to 100ms — already tight. Models that take 200ms by themselves are non-starters for many real-time use cases.
Dynamic batching
Group concurrent requests into a single forward pass. Doubles or triples throughput on GPUs because the bottleneck is per-call overhead, not per-example compute. Trades a small latency increase (wait up to ~10ms for batchmates) for huge throughput gains. Standard in inference servers (Triton, TorchServe, vLLM).
Quantization and distillation
- Quantization: cast weights and/or activations to int8 or int4. 2–4x throughput, 2–4x memory reduction. Calibrate on a small dataset to minimize accuracy drop.
- Distillation: train a small "student" model to mimic a large "teacher" model. Often gets 90% of teacher accuracy at 1/10th the size. Common for putting LLMs behind real-time products.
Failure modes
- Cold start: first request after model load is slow because weights aren't paged in. Warm up at deploy time with synthetic requests.
- Memory leaks: long-running serving processes accumulate small allocations. Restart on a schedule.
- Tail latency: 99th percentile is often 5–10x the median. Set timeouts and degrade gracefully.
- Version skew between feature pipeline and model: deploy them atomically.