OLS minimizes squared error with no penalty on coefficient magnitudes. When predictors are correlated or there are many of them, coefficients can become huge and unstable. Regularization adds a penalty term that pulls coefficients toward zero.
Ridge (L2)
Minimize . The penalty shrinks coefficients smoothly toward zero. Closed form: . Doesn't eliminate predictors — every stays nonzero, just smaller.
Lasso (L1)
Minimize . The L1 penalty has a geometric corner at zero — pushes coefficients exactly to zero. Performs variable selection automatically. No closed form; solved with coordinate descent or proximal methods.
Elastic Net
A mix: . Combines lasso's selection with ridge's stability for correlated predictors (lasso alone tends to arbitrarily pick one of a correlated group).
Picking lambda
Use cross-validation. Try a grid of values, compute out-of-sample error for each, pick the one with the lowest CV error (or 1 SE above it for more parsimony). scikit-learn's RidgeCV and LassoCV do this automatically.
Standardize first
Regularization penalizes coefficients directly. If is measured in inches and in miles, would naturally be larger — and over-penalized. Standardize (-score) all predictors before fitting any regularized regression.
Why use regularization
- High-dimensional data: (gene expression, text). OLS is ill-posed; ridge/lasso handle it.
- Multicollinearity: ridge stabilizes coefficients when predictors are correlated.
- Sparse expectation: if you believe only a few predictors actually matter, lasso encodes that prior.
- Better predictions: regularization usually wins on out-of-sample accuracy even when OLS would still run.