Cross-Validation — Section 7: Evaluation and Model Selection

Cross-validation estimates how a model will generalize without burning through your test set. You train on a subset, evaluate on the rest, rotate.

K-fold

Split the data into $K$ equal folds. Train on $K-1$ folds, evaluate on the held-out fold. Repeat $K$ times so every fold is held out once. Average the metrics. Standard $K$ is 5 or 10 — more folds give a lower-variance estimate but cost more compute.

Stratified K-fold

For classification with class imbalance: ensure each fold has roughly the same class proportions as the full dataset. Without stratification, a fold can end up with no minority-class examples and your accuracy estimates become noisy.

Group K-fold

When rows aren't independent — multiple visits per patient, multiple frames per video, multiple events per user — all rows from the same entity should be in the same fold. Otherwise the model sees the same entity in train and validation, and the model "remembers" entity-specific quirks instead of generalizing. This is the most common leakage source I see in real datasets.

Time-series CV

Random splits don't work when the data has temporal order. Two options:

Forward-chaining: train on first $k$ months, validate on month $k+1$ . Slide forward. Models the realistic "what would this have predicted at the time" scenario.
Block CV: like K-fold but with contiguous time blocks. Useful when you have many years and seasonality.

The hard rule: training-set timestamps should always precede validation-set timestamps. Otherwise you're using the future to predict the past.

When NOT to use CV

Genuinely huge datasets ( $n > 10^6$ ): a single train/val/test split is fine and dramatically cheaper.
Deep learning with single-epoch budgets: you can't afford to retrain 5x. Use a held-out validation set instead.
Highly noisy small datasets: CV folds become unstable; consider nested CV with caution.