Linear Regression — Section 2: Linear Models

Linear regression models $y = X\beta + \epsilon$ with $\epsilon \sim N(0, \sigma^2 I)$ . The maximum-likelihood estimator coincides with ordinary least squares (OLS):

\hat{\beta} = (X^T X)^{-1} X^T y

What it costs and when it breaks

Compute: $O(p^2 n + p^3)$ — fine for $p$ in the low thousands. Beyond that, use iterative solvers (SGD, conjugate gradient).
Multicollinearity: if columns of $X$ are nearly linearly dependent, $X^T X$ becomes ill-conditioned. Coefficients blow up, standard errors explode, prediction can still be fine. Detect with VIF; fix with regularization or by dropping correlated features.
Linearity: residuals should look unstructured. Curved residual-vs-fitted plots mean you're missing nonlinear structure.
Homoscedasticity: residual variance shouldn't depend on $\hat{y}$ . Funnel-shaped residuals → use weighted least squares or transform $y$ .

Coefficient interpretation

$\beta_j$ is the expected change in $y$ when feature $j$ increases by 1, holding all other features fixed. The "holding fixed" matters: if features are correlated, the marginal effect of one feature in isolation can be very different from $\beta_j$ .

When linear regression is the right tool

Small data (< few thousand rows) where you need interpretability
Feature engineering is rich enough to encode the structure
You want statistically defensible confidence intervals on coefficients
Baseline for any new problem — beating a linear model with good features is harder than beating one with bad features

The default first thing you should try.

Linear regression models $y = X\beta + \epsilon$ with $\epsilon \sim N(0, \sigma^2 I)$ . The maximum-likelihood estimator coincides with ordinary least squares (OLS):

\hat{\beta} = (X^T X)^{-1} X^T y

What it costs and when it breaks

Compute: $O(p^2 n + p^3)$ — fine for $p$ in the low thousands. Beyond that, use iterative solvers (SGD, conjugate gradient).
Multicollinearity: if columns of $X$ are nearly linearly dependent, $X^T X$ becomes ill-conditioned. Coefficients blow up, standard errors explode, prediction can still be fine. Detect with VIF; fix with regularization or by dropping correlated features.
Linearity: residuals should look unstructured. Curved residual-vs-fitted plots mean you're missing nonlinear structure.
Homoscedasticity: residual variance shouldn't depend on $\hat{y}$ . Funnel-shaped residuals → use weighted least squares or transform $y$ .

Coefficient interpretation

When linear regression is the right tool

Small data (< few thousand rows) where you need interpretability
Feature engineering is rich enough to encode the structure
You want statistically defensible confidence intervals on coefficients
Baseline for any new problem — beating a linear model with good features is harder than beating one with bad features

The default first thing you should try.