Bias-Variance Tradeoff — Section 1: Supervised Learning Foundations

For squared-error regression, the expected error on a new point decomposes into three irreducible pieces:

E[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2

Bias: how far the average prediction (over different training sets) is from the true function. Caused by an inadequate hypothesis class.
Variance: how much predictions wobble across different training sets. Caused by the model fitting the noise.
Irreducible noise $\sigma^2$ : floor below which you can't go.

The classical picture

High-bias, low-variance: linear regression on a curvy target. Predictions are stable but systematically off.
Low-bias, high-variance: deep tree on small data. Fits training noise exactly; new predictions swing wildly.
The sweet spot lives in the middle — bias and variance trade off as model complexity increases.

What modern deep learning broke

Massively overparameterized models violate the classical picture. Networks with billions of parameters interpolate training data perfectly (zero training loss) and still generalize. The "double descent" phenomenon shows test error first rises with complexity (classical regime), peaks at the interpolation threshold, then falls again. Implicit regularization from SGD, weight decay, and architectural choices appears to control variance even in the overparameterized regime — but the theory is far from settled.

Practical levers

More data → lowers variance, doesn't help bias.
Stronger regularization → lowers variance, can raise bias.
More complex model → lowers bias, raises variance (classical) or both fall (modern, depending on scale).
Ensembling → averages out variance.

For squared-error regression, the expected error on a new point decomposes into three irreducible pieces:

E[(y - \hat{f}(x))^2] = \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \sigma^2

Bias: how far the average prediction (over different training sets) is from the true function. Caused by an inadequate hypothesis class.
Variance: how much predictions wobble across different training sets. Caused by the model fitting the noise.
Irreducible noise $\sigma^2$ : floor below which you can't go.

The classical picture

High-bias, low-variance: linear regression on a curvy target. Predictions are stable but systematically off.
Low-bias, high-variance: deep tree on small data. Fits training noise exactly; new predictions swing wildly.
The sweet spot lives in the middle — bias and variance trade off as model complexity increases.

What modern deep learning broke

Practical levers

More data → lowers variance, doesn't help bias.
Stronger regularization → lowers variance, can raise bias.
More complex model → lowers bias, raises variance (classical) or both fall (modern, depending on scale).
Ensembling → averages out variance.