Fitting a model gives you training error. What you actually care about is GENERALIZATION error — how well it does on data it hasn't seen. Honest evaluation requires never letting test data touch the model selection process.
The three-way split
- Training set (~60%): fit model parameters.
- Validation set (~20%): tune hyperparameters, pick between models.
- Test set (~20%): final unbiased evaluation. Touch it ONCE.
If you tune to the test set, you're overfitting to it. The test set is for the final number you'd report in a paper or to a stakeholder — every step before that uses validation.
Stratified splitting
For classification with imbalanced classes, stratify the split so each set has the same class proportions. Otherwise a "rare" class might end up underrepresented in the validation set, making the metric estimate noisy.
Cross-validation
When data is limited, holding out 40% is wasteful. K-fold CV: split into K folds; for each fold, train on the rest, evaluate on that fold; average. K = 5 or 10 is standard. Reduces the variance of your performance estimate.
Stratified K-fold for classification. Group K-fold when observations cluster (multiple readings per patient — keep all of one patient's readings in the same fold).
Time series
NEVER use random splits for time series — it leaks future data into training. Use a forward-chaining split: train on the first 6 months, test on month 7; train on the first 7 months, test on month 8; etc. Walk-forward validation.
Leakage
The cardinal sin. Leakage = test information sneaking into training. Sources:
- Test data used in feature engineering (means, normalizations computed on combined data)
- Identifying fields that perfectly predict the target (customer ID with a one-row-per-customer prediction)
- Time-related leakage (using future values to predict the past)
- Duplicate or near-duplicate rows split across sets
When your test accuracy is suspiciously high, suspect leakage before celebrating.