A/B Testing — Section 10: Practical Considerations

A/B testing randomly assigns users to a treatment group or a control group and compares outcomes. It's how product teams measure causal effects in production. Simple in principle, full of pitfalls in practice.

The setup

Random assignment, equal-sized groups (usually 50/50), pre-defined success metric, pre-defined sample size based on power analysis, run until target sample reached, analyze once.

Power analysis

Decide ahead of time how big an effect you want to detect, with what confidence (typically 95%) and power (typically 80%). Compute the required sample size. Underpowered tests miss real effects; overpowered tests waste user exposure.

For comparing two proportions, sample size per group is approximately:

n = \frac{2 \bar{p} (1 - \bar{p}) (z_{\alpha/2} + z_{\beta})^2}{(\text{MDE})^2}

where MDE is the minimum detectable effect and $\bar{p}$ is the baseline rate.

Common pitfalls

Peeking: checking the test repeatedly and stopping when significant. Inflates Type I error massively. Either pre-set the sample size, or use sequential testing methods designed for repeated looks.
Multiple comparisons: testing many metrics, claiming significance on whichever wins. Correct with Bonferroni or accept that you're hypothesis-generating, not confirming.
Sample ratio mismatch (SRM): groups aren't the size they should be. Suggests broken randomization or selection bias.
Novelty effect: users react to anything new, results regress after the novelty fades. Run tests long enough to capture steady state.
Network effects / interference: in social products, treatment users influence control users. Standard A/B assumptions break — use cluster randomization or network analysis.

What to measure

Pre-register the primary metric. Move secondary metrics in the right direction or you have a multi-metric problem. Always check guardrails (revenue, latency, error rates) didn't break.

When NOT to A/B test

Effect too small to detect with feasible sample size
Ethical issues with random assignment (medical, financial harm)
Long-term effects (years) — switchback designs or quasi-experiments
Network effects break the randomization assumption