Supervised learning asks: given pairs drawn from some distribution , find a function such that for new pairs drawn from the same .
The four pieces
1. Input space : feature vectors, images, text, anything. 2. Output space : real values (regression), discrete labels (classification), structured outputs (sequences, sets). 3. Hypothesis class : the set of candidate functions we're willing to consider. Linear models, trees, neural nets — each defines a different . 4. Loss function : how badly we're penalized when .
What we actually optimize
We can't compute the expected loss over the true distribution — we don't have it. Instead we minimize empirical risk over the training set:
The hope: if our hypothesis class isn't too expressive and our training set is large enough, the function that minimizes empirical risk will also have low expected risk on unseen data. That gap — empirical risk vs true risk — is what every regularization technique tries to control.
i.i.d. and the elephant in the room
The whole framework assumes training and test data are drawn from the same distribution. In production this is almost never strictly true. Users change. Products change. Adversaries adapt. Most ML failures are distribution-shift failures.