The Learning Problem — Section 1: Supervised Learning Foundations

Supervised learning asks: given pairs $(x_i, y_i)$ drawn from some distribution $\mathcal{D}$ , find a function $f$ such that $f(x) \approx y$ for new pairs drawn from the same $\mathcal{D}$ .

The four pieces

1. Input space $\mathcal{X}$ : feature vectors, images, text, anything. 2. Output space $\mathcal{Y}$ : real values (regression), discrete labels (classification), structured outputs (sequences, sets). 3. Hypothesis class $\mathcal{H}$ : the set of candidate functions we're willing to consider. Linear models, trees, neural nets — each defines a different $\mathcal{H}$ . 4. Loss function $L(\hat{y}, y)$ : how badly we're penalized when $\hat{y} \neq y$ .

What we actually optimize

We can't compute the expected loss over the true distribution $\mathcal{D}$ — we don't have it. Instead we minimize empirical risk over the training set:

\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)

The hope: if our hypothesis class isn't too expressive and our training set is large enough, the function that minimizes empirical risk will also have low expected risk on unseen data. That gap — empirical risk vs true risk — is what every regularization technique tries to control.

i.i.d. and the elephant in the room

The whole framework assumes training and test data are drawn from the same distribution. In production this is almost never strictly true. Users change. Products change. Adversaries adapt. Most ML failures are distribution-shift failures.

Supervised learning asks: given pairs $(x_i, y_i)$ drawn from some distribution $\mathcal{D}$ , find a function $f$ such that $f(x) \approx y$ for new pairs drawn from the same $\mathcal{D}$ .

The four pieces

What we actually optimize

We can't compute the expected loss over the true distribution $\mathcal{D}$ — we don't have it. Instead we minimize empirical risk over the training set:

\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)