Generalization and ERM — Section 1: Supervised Learning Foundations

Empirical Risk Minimization (ERM) — pick the $f \in \mathcal{H}$ that minimizes training loss — is the workhorse of machine learning. But minimizing training loss isn't the same as minimizing the true expected loss on the distribution.

The generalization gap

Let $R(f)$ be the true risk and $\hat{R}(f)$ the empirical risk. Two questions:

1. How close is $\hat{R}(f)$ to $R(f)$ for any fixed $f$ ? Easy — the law of large numbers gives us $O(1/\sqrt{n})$ concentration. 2. How close is $\hat{R}(\hat{f})$ to $R(\hat{f})$ when $\hat{f}$ was *chosen using the training data*? Much harder — $\hat{f}$ depends on the same data we're using to evaluate.

The gap grows with the expressiveness of $\mathcal{H}$ . VC dimension and Rademacher complexity formalize this. Roughly:

R(\hat{f}) - \hat{R}(\hat{f}) \leq O\left(\sqrt{\frac{\text{complexity}(\mathcal{H})}{n}}\right)

Why this matters in practice

You don't ever use VC bounds to pick a model — they're vacuous for modern networks (often >1). But the intuition is correct: bigger hypothesis class + same data = bigger generalization gap. Two practical implications:

1. Train/test contamination is catastrophic. Any leakage of test info into training breaks the iid assumption your generalization argument rested on. 2. Model selection costs degrees of freedom. If you try 100 hyperparameter combinations on validation data and pick the best, your validation score is biased upward. Use a separate held-out test set for final evaluation.

Structural risk minimization

Add a penalty $\Omega(f)$ that prefers simpler functions:

\hat{f} = \arg\min_{f \in \mathcal{H}} \hat{R}(f) + \lambda \Omega(f)

Ridge ( $L_2$ ), lasso ( $L_1$ ), and weight decay are all SRM in disguise. Pick $\lambda$ on a validation set — but only once, or you're back to (2) above.