Linear regression models with . The maximum-likelihood estimator coincides with ordinary least squares (OLS):
What it costs and when it breaks
- Compute: — fine for in the low thousands. Beyond that, use iterative solvers (SGD, conjugate gradient).
- Multicollinearity: if columns of are nearly linearly dependent, becomes ill-conditioned. Coefficients blow up, standard errors explode, prediction can still be fine. Detect with VIF; fix with regularization or by dropping correlated features.
- Linearity: residuals should look unstructured. Curved residual-vs-fitted plots mean you're missing nonlinear structure.
- Homoscedasticity: residual variance shouldn't depend on . Funnel-shaped residuals → use weighted least squares or transform .
Coefficient interpretation
is the expected change in when feature increases by 1, holding all other features fixed. The "holding fixed" matters: if features are correlated, the marginal effect of one feature in isolation can be very different from .
When linear regression is the right tool
- Small data (< few thousand rows) where you need interpretability
- Feature engineering is rich enough to encode the structure
- You want statistically defensible confidence intervals on coefficients
- Baseline for any new problem — beating a linear model with good features is harder than beating one with bad features
The default first thing you should try.