Data leakage = information from the future or from the target leaking into training features. Symptoms: test accuracy is suspiciously high; production model performs nothing like the offline evaluation. Always investigate when results look too good.
Train/test contamination
The classic. Test data influences training: feature normalization computed on combined data; rows duplicated across splits; data augmentation that crosses split boundaries; clustering or PCA fit on combined data.
Fix: do ALL preprocessing inside the cross-validation loop, treating each fold's test partition as completely unseen.
Target leakage
A feature that wouldn't be available at prediction time, OR a feature that's a function of the target.
Examples:
- Predicting whether a customer will churn next month, using their "last login date" — which won't update until they actually log in
- Predicting hospital readmission using post-discharge fields that aren't available at discharge time
- Predicting credit default using fees charged after the default occurred
Fix: think causally about the timeline. For each feature, ask "would this be observable at the time of the prediction?"
Temporal leakage
For time series, using future data to predict past data. Subtle versions:
- Computing a "trailing mean" on the full series instead of a backward-looking window
- Joining external data without time alignment (yesterday's prediction enriched with tomorrow's metadata)
Fix: always use forward-chaining splits and explicit point-in-time joins.
Identity leakage
A feature that uniquely identifies the row or the target. Patient ID predicting their condition; transaction ID predicting fraud. Models exploit these as "memorization" with no generalization.
Group leakage
Same user / patient / device appears in both train and test. The model can memorize per-entity patterns rather than learning generalizable signal.
Fix: split BY GROUP (group K-fold) so all rows for a given entity are in the same fold.
Detection
- Suspiciously high accuracy (90%+ on a hard problem)
- Single feature with huge feature importance — check its definition
- Performance gap between cross-validation and production
- Try fitting just on the top feature — if it alone achieves near-perfect accuracy, you have leakage