Regularization — Section 2: Linear Models

Regularization adds a penalty $\Omega(\beta)$ to the loss to shrink coefficients toward zero. Three classics:

Ridge ($L_2$)

\hat{\beta} = \arg\min ||y - X\beta||^2 + \lambda ||\beta||_2^2

Closed form: $\hat{\beta} = (X^T X + \lambda I)^{-1} X^T y$ . The $\lambda I$ regularizes the conditioning of $X^T X$ — every singular value gets bumped up by $\lambda$ . Coefficients shrink toward zero but rarely reach it. Use when you have many weak predictors all contributing a bit.

Lasso ($L_1$)

\hat{\beta} = \arg\min ||y - X\beta||^2 + \lambda ||\beta||_1

No closed form — solve via coordinate descent or LARS. The $L_1$ penalty has corners at the axes, so the optimum often lands exactly on a corner: some coefficients are exactly zero. Feature selection comes for free. Use when you believe only a sparse subset of features matters.

Elastic net

Convex combination of both: $\lambda (\alpha ||\beta||_1 + (1-\alpha) ||\beta||_2^2)$ . Inherits lasso's sparsity but groups correlated features together — pure lasso picks one and zeroes the rest arbitrarily, elastic net keeps the cluster.

Bayesian interpretation

Ridge = Gaussian prior on $\beta$ . Lasso = Laplace prior. The MAP estimate equals the penalized MLE. This is why these are the "natural" priors — they correspond to the simplest distributions concentrated near zero.

Picking $\lambda$

Cross-validation. Almost never use a closed-form rule (BIC etc.) — they make assumptions the data violates.