Model Serving — Section 8: Production ML

Training optimizes for accuracy. Serving optimizes for accuracy *under* latency, throughput, and cost constraints. Different problem.

Batch vs online

Batch scoring: precompute predictions for many entities ahead of time (every hour, every night). Latency at prediction time = single lookup. Use when inputs are predictable and don't change often (per-user daily recommendations, churn scores).
Online (request-time) scoring: compute predictions on demand. Required when inputs include real-time signals (current session, current cart, dynamic context).

Hybrid: precompute embeddings for users and items offline; combine online with cheap operations (dot product) at request time. Almost every recommendation system is built this way.

Latency budget

For a 100ms total budget:

10ms: network round trip
30ms: feature fetch + preprocessing
50ms: model inference
10ms: post-processing + serialization

This adds up to 100ms — already tight. Models that take 200ms by themselves are non-starters for many real-time use cases.

Dynamic batching

Group concurrent requests into a single forward pass. Doubles or triples throughput on GPUs because the bottleneck is per-call overhead, not per-example compute. Trades a small latency increase (wait up to ~10ms for batchmates) for huge throughput gains. Standard in inference servers (Triton, TorchServe, vLLM).

Quantization and distillation

Quantization: cast weights and/or activations to int8 or int4. 2–4x throughput, 2–4x memory reduction. Calibrate on a small dataset to minimize accuracy drop.
Distillation: train a small "student" model to mimic a large "teacher" model. Often gets 90% of teacher accuracy at 1/10th the size. Common for putting LLMs behind real-time products.

Failure modes

Cold start: first request after model load is slow because weights aren't paged in. Warm up at deploy time with synthetic requests.
Memory leaks: long-running serving processes accumulate small allocations. Restart on a schedule.
Tail latency: 99th percentile is often 5–10x the median. Set timeouts and degrade gracefully.
Version skew between feature pipeline and model: deploy them atomically.