Regularization adds a penalty to the loss to shrink coefficients toward zero. Three classics:
Ridge ($L_2$)
Closed form: . The regularizes the conditioning of — every singular value gets bumped up by . Coefficients shrink toward zero but rarely reach it. Use when you have many weak predictors all contributing a bit.
Lasso ($L_1$)
No closed form — solve via coordinate descent or LARS. The penalty has corners at the axes, so the optimum often lands exactly on a corner: some coefficients are exactly zero. Feature selection comes for free. Use when you believe only a sparse subset of features matters.
Elastic net
Convex combination of both: . Inherits lasso's sparsity but groups correlated features together — pure lasso picks one and zeroes the rest arbitrarily, elastic net keeps the cluster.
Bayesian interpretation
Ridge = Gaussian prior on . Lasso = Laplace prior. The MAP estimate equals the penalized MLE. This is why these are the "natural" priors — they correspond to the simplest distributions concentrated near zero.
Picking $\lambda$
Cross-validation. Almost never use a closed-form rule (BIC etc.) — they make assumptions the data violates.