Cross-validation estimates how a model will generalize without burning through your test set. You train on a subset, evaluate on the rest, rotate.
K-fold
Split the data into equal folds. Train on folds, evaluate on the held-out fold. Repeat times so every fold is held out once. Average the metrics. Standard is 5 or 10 — more folds give a lower-variance estimate but cost more compute.
Stratified K-fold
For classification with class imbalance: ensure each fold has roughly the same class proportions as the full dataset. Without stratification, a fold can end up with no minority-class examples and your accuracy estimates become noisy.
Group K-fold
When rows aren't independent — multiple visits per patient, multiple frames per video, multiple events per user — all rows from the same entity should be in the same fold. Otherwise the model sees the same entity in train and validation, and the model "remembers" entity-specific quirks instead of generalizing. This is the most common leakage source I see in real datasets.
Time-series CV
Random splits don't work when the data has temporal order. Two options:
- Forward-chaining: train on first months, validate on month . Slide forward. Models the realistic "what would this have predicted at the time" scenario.
- Block CV: like K-fold but with contiguous time blocks. Useful when you have many years and seasonality.
The hard rule: training-set timestamps should always precede validation-set timestamps. Otherwise you're using the future to predict the past.
When NOT to use CV
- Genuinely huge datasets (): a single train/val/test split is fine and dramatically cheaper.
- Deep learning with single-epoch budgets: you can't afford to retrain 5x. Use a held-out validation set instead.
- Highly noisy small datasets: CV folds become unstable; consider nested CV with caution.