Data Leakage — Section 10: Practical Considerations

Data leakage = information from the future or from the target leaking into training features. Symptoms: test accuracy is suspiciously high; production model performs nothing like the offline evaluation. Always investigate when results look too good.

Train/test contamination

The classic. Test data influences training: feature normalization computed on combined data; rows duplicated across splits; data augmentation that crosses split boundaries; clustering or PCA fit on combined data.

Fix: do ALL preprocessing inside the cross-validation loop, treating each fold's test partition as completely unseen.

Target leakage

A feature that wouldn't be available at prediction time, OR a feature that's a function of the target.

Examples:

Predicting whether a customer will churn next month, using their "last login date" — which won't update until they actually log in
Predicting hospital readmission using post-discharge fields that aren't available at discharge time
Predicting credit default using fees charged after the default occurred

Fix: think causally about the timeline. For each feature, ask "would this be observable at the time of the prediction?"

Temporal leakage

For time series, using future data to predict past data. Subtle versions:

Computing a "trailing mean" on the full series instead of a backward-looking window
Joining external data without time alignment (yesterday's prediction enriched with tomorrow's metadata)

Fix: always use forward-chaining splits and explicit point-in-time joins.

Identity leakage

A feature that uniquely identifies the row or the target. Patient ID predicting their condition; transaction ID predicting fraud. Models exploit these as "memorization" with no generalization.

Group leakage

Same user / patient / device appears in both train and test. The model can memorize per-entity patterns rather than learning generalizable signal.

Fix: split BY GROUP (group K-fold) so all rows for a given entity are in the same fold.

Detection

Suspiciously high accuracy (90%+ on a hard problem)
Single feature with huge feature importance — check its definition
Performance gap between cross-validation and production
Try fitting just on the top feature — if it alone achieves near-perfect accuracy, you have leakage