Train/Test/Validation Splits — Section 6: Model Evaluation

Fitting a model gives you training error. What you actually care about is GENERALIZATION error — how well it does on data it hasn't seen. Honest evaluation requires never letting test data touch the model selection process.

The three-way split

Training set (~60%): fit model parameters.
Validation set (~20%): tune hyperparameters, pick between models.
Test set (~20%): final unbiased evaluation. Touch it ONCE.

If you tune to the test set, you're overfitting to it. The test set is for the final number you'd report in a paper or to a stakeholder — every step before that uses validation.

Stratified splitting

For classification with imbalanced classes, stratify the split so each set has the same class proportions. Otherwise a "rare" class might end up underrepresented in the validation set, making the metric estimate noisy.

Cross-validation

When data is limited, holding out 40% is wasteful. K-fold CV: split into K folds; for each fold, train on the rest, evaluate on that fold; average. K = 5 or 10 is standard. Reduces the variance of your performance estimate.

Stratified K-fold for classification. Group K-fold when observations cluster (multiple readings per patient — keep all of one patient's readings in the same fold).

Time series

NEVER use random splits for time series — it leaks future data into training. Use a forward-chaining split: train on the first 6 months, test on month 7; train on the first 7 months, test on month 8; etc. Walk-forward validation.

Leakage

The cardinal sin. Leakage = test information sneaking into training. Sources:

Test data used in feature engineering (means, normalizations computed on combined data)
Identifying fields that perfectly predict the target (customer ID with a one-row-per-customer prediction)
Time-related leakage (using future values to predict the past)
Duplicate or near-duplicate rows split across sets

When your test accuracy is suspiciously high, suspect leakage before celebrating.