Ordinary Least Squares — Section 4: Linear Regression

Linear regression fits a line (or hyperplane) through data by minimizing the sum of squared residuals. It's the workhorse of statistics — simple, interpretable, and a building block for nearly every fancier method.

The model

$y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i$ , with $\epsilon \sim N(0, \sigma^2)$ i.i.d. across observations.

OLS estimator

Minimize $\sum_i (y_i - \hat{y}_i)^2$ . Closed form: $\hat{\beta} = (X^T X)^{-1} X^T y$ , where $X$ is the design matrix with a column of ones for the intercept.

What the coefficients mean

$\beta_j$ is the expected change in $y$ for a one-unit increase in $x_j$ , HOLDING ALL OTHER PREDICTORS CONSTANT. The "all else equal" qualifier matters — it makes the coefficient interpretable only when the other predictors are measured and modeled.

$R^2$

Fraction of variance in $y$ explained by the model. $R^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}$ . Range $[0, 1]$ for a model with intercept. Higher is better but easy to game by adding predictors — use adjusted $R^2$ or cross-validated $R^2$ instead.

Inference

Each coefficient has a standard error and a t-statistic. Coefficients with $|t| > 2$ (roughly $p < 0.05$ ) are "significant." Significance speaks to "is this effect distinguishable from zero" — not "is this effect important." A tiny coefficient on a precisely-measured variable can be significant but meaningless.

When OLS fails

OLS is BLUE (Best Linear Unbiased Estimator) under the assumptions: linearity, independence, homoscedasticity (equal variance of errors), no perfect multicollinearity, normal errors (for inference, not estimation). Real data violates some of these all the time — diagnostic plots (residuals vs fitted, Q-Q, leverage) tell you which.

The model

$y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i$ , with $\epsilon \sim N(0, \sigma^2)$ i.i.d. across observations.

OLS estimator

Minimize $\sum_i (y_i - \hat{y}_i)^2$ . Closed form: $\hat{\beta} = (X^T X)^{-1} X^T y$ , where $X$ is the design matrix with a column of ones for the intercept.