Regularization — Ridge and Lasso — Section 4: Linear Regression

OLS minimizes squared error with no penalty on coefficient magnitudes. When predictors are correlated or there are many of them, coefficients can become huge and unstable. Regularization adds a penalty term that pulls coefficients toward zero.

Ridge (L2)

Minimize $\sum (y - \hat{y})^2 + \lambda \sum \beta_j^2$ . The penalty $\lambda \sum \beta_j^2$ shrinks coefficients smoothly toward zero. Closed form: $\hat{\beta} = (X^T X + \lambda I)^{-1} X^T y$ . Doesn't eliminate predictors — every $\beta_j$ stays nonzero, just smaller.

Lasso (L1)

Minimize $\sum (y - \hat{y})^2 + \lambda \sum |\beta_j|$ . The L1 penalty has a geometric corner at zero — pushes coefficients exactly to zero. Performs variable selection automatically. No closed form; solved with coordinate descent or proximal methods.

Elastic Net

A mix: $\lambda_1 \sum |\beta_j| + \lambda_2 \sum \beta_j^2$ . Combines lasso's selection with ridge's stability for correlated predictors (lasso alone tends to arbitrarily pick one of a correlated group).

Picking lambda

Use cross-validation. Try a grid of $\lambda$ values, compute out-of-sample error for each, pick the one with the lowest CV error (or 1 SE above it for more parsimony). scikit-learn's RidgeCV and LassoCV do this automatically.

Standardize first

Regularization penalizes coefficients directly. If $x_1$ is measured in inches and $x_2$ in miles, $\beta_1$ would naturally be larger — and over-penalized. Standardize ( $z$ -score) all predictors before fitting any regularized regression.

Why use regularization

High-dimensional data: $p \gg n$ (gene expression, text). OLS is ill-posed; ridge/lasso handle it.
Multicollinearity: ridge stabilizes coefficients when predictors are correlated.
Sparse expectation: if you believe only a few predictors actually matter, lasso encodes that prior.
Better predictions: regularization usually wins on out-of-sample accuracy even when OLS would still run.