Empirical Risk Minimization (ERM) — pick the that minimizes training loss — is the workhorse of machine learning. But minimizing training loss isn't the same as minimizing the true expected loss on the distribution.
The generalization gap
Let be the true risk and the empirical risk. Two questions:
1. How close is to for any fixed ? Easy — the law of large numbers gives us concentration. 2. How close is to when was *chosen using the training data*? Much harder — depends on the same data we're using to evaluate.
The gap grows with the expressiveness of . VC dimension and Rademacher complexity formalize this. Roughly:
Why this matters in practice
You don't ever use VC bounds to pick a model — they're vacuous for modern networks (often >1). But the intuition is correct: bigger hypothesis class + same data = bigger generalization gap. Two practical implications:
1. Train/test contamination is catastrophic. Any leakage of test info into training breaks the iid assumption your generalization argument rested on. 2. Model selection costs degrees of freedom. If you try 100 hyperparameter combinations on validation data and pick the best, your validation score is biased upward. Use a separate held-out test set for final evaluation.
Structural risk minimization
Add a penalty that prefers simpler functions:
Ridge (), lasso (), and weight decay are all SRM in disguise. Pick on a validation set — but only once, or you're back to (2) above.