A/B Testing ML Models — Section 7: Evaluation and Model Selection

Offline metrics (AUC, RMSE, etc.) don't directly answer "will this model make the business better." For that you need an online experiment.

Random unit of assignment

Decide what gets randomized. Almost always users (or some persistent identifier), not requests. Randomizing per-request mixes treatment and control across the same user — they see a stew, the business outcome confounds, and the test is uninterpretable.

Cluster correctly. Geographic experiments cluster by city. Marketplace experiments may need both sides (buyer and seller) handled carefully to avoid interference.

Power

How long do you need to run? Roughly: variance × $z$ -score² / effect size². If you expect a 1% relative lift on a metric with 30% baseline standard deviation, you need many tens of thousands of users for statistical detection. Compute this *before* launching, not after.

What to measure

Primary metric: the business outcome — revenue, conversion, retention. Often noisy.
Guardrails: latency, errors, complaints. The new model can't degrade these.
Counter-metric: things you fear the model is gaming (engagement at the cost of quality, clicks at the cost of revenue).

Pre-register all of these. Adding metrics after seeing results is multiple-comparison hell.

Common failure modes

Novelty effects: people interact more with anything new. Run for at least 2 weeks; check if effect persists.
Selection bias: experimentation framework wasn't truly random. Always check that pre-experiment metrics are equal between arms (A/A test).
Sequential testing without correction: peeking at results and stopping when significance hits inflates false positive rate. Either commit to a fixed sample size or use sequential tests (mSPRT, alpha spending).
Network effects / interference: in marketplaces and social networks, treatment users affect control users. Use cluster randomization or switchback designs.

Beyond A/B

Interleaving (for ranking systems): mix items from both rankers in one list, see which side gets more clicks. Much higher power per user-day than separate-arm A/B.
Multi-armed bandits: useful when exploration is expensive and you want to converge faster. Less interpretable; bias the estimate of average treatment effect.