Models degrade silently. By the time someone notices accuracy fell, you've lost weeks of value. Monitoring catches degradation early — but only if you're watching the right things.
What to monitor (in order of priority)
1. Input distribution drift: are the features themselves changing? Compare current feature distributions to training feature distributions. Common: PSI (Population Stability Index), KL divergence, or simpler quantile shifts. 2. Prediction distribution drift: are the *predictions* changing? Useful when ground-truth labels are delayed. If prediction distributions are stable but inputs are not, the model is generalizing well. 3. Performance: when you have labels, compare actual accuracy/AUC/RMSE to training. Hard for many problems where labels are delayed (churn: 30 days; loan default: 12 months). 4. Operational: latency, error rate, request volume. These have nothing to do with model accuracy but everything to do with whether your service works.
PSI: the most-used drift metric
For continuous features, bin into 10 quantiles. PSI compares the current proportion in each bin to the training proportion:
Rule of thumb: PSI < 0.1 is fine; 0.1–0.25 mild drift; > 0.25 significant. Calibrate the threshold to your problem — many "drifting" features don't actually hurt performance.
What to alert on, what to ignore
- Alert on: performance metrics if you have them, predictions-vs-historical-baseline (e.g., today's predicted fraud rate is 3× the long-term average — something changed), and high-PSI on critical features.
- Don't alert on: minor PSI on every feature (alert fatigue), distribution differences that don't translate to accuracy loss.
Acting on drift
Drift alone isn't a problem — it's a signal. Always check whether drift translates into performance loss before retraining. If labels are slow, use shadow models, holdout slices, or proxy metrics.
When you do retrain, retain the old model and A/B the new one (or shadow it) before flipping traffic.