Regression Assumptions and Diagnostics — Section 4: Linear Regression

OLS gives unbiased coefficient estimates under a set of assumptions. Violations don't always invalidate the model — but you should know which ones bite and how badly.

The assumptions (Gauss-Markov)

1. Linearity: $E[y | x] = X\beta$ — the conditional mean is linear in the predictors. 2. Independence: residuals are uncorrelated. 3. Homoscedasticity: residuals have equal variance regardless of $x$ . 4. No perfect multicollinearity: $X^T X$ is invertible. 5. Normality of residuals (only for inference, not estimation).

Diagnostic plots

Residuals vs fitted: should look like noise. Patterns (curve, fan) suggest nonlinearity or heteroscedasticity.
Q-Q plot: residual quantiles vs normal quantiles. Should be a straight line. Bowed = skew; S-shape = heavy tails.
Scale-location: $\sqrt{|\text{residual}|}$ vs fitted. Slope ≠ 0 = heteroscedasticity.
Leverage / Cook's distance: identifies influential outliers — points that disproportionately move the fit.

Common violations and fixes

Nonlinearity: add polynomial terms, log-transform, or use spline / tree-based models.
Heteroscedasticity: use robust (White) standard errors; coefficients stay unbiased.
Correlated errors: time-series data needs autoregressive errors; panel data needs random or fixed effects.
Outliers: investigate, don't auto-drop. Sometimes they're the most important observations; sometimes they're data entry errors.
Multicollinearity: drop redundant predictors, combine them (PCA), or regularize (ridge/lasso).

When to give up on linearity

If residuals show clear curvature, log-transform $y$ or add polynomial terms. If a single transformation doesn't fix it, the problem is probably structural — use a tree-based or kernel method instead.

OLS gives unbiased coefficient estimates under a set of assumptions. Violations don't always invalidate the model — but you should know which ones bite and how badly.

The assumptions (Gauss-Markov)

Diagnostic plots

Residuals vs fitted: should look like noise. Patterns (curve, fan) suggest nonlinearity or heteroscedasticity.
Q-Q plot: residual quantiles vs normal quantiles. Should be a straight line. Bowed = skew; S-shape = heavy tails.
Scale-location: $\sqrt{|\text{residual}|}$ vs fitted. Slope ≠ 0 = heteroscedasticity.
Leverage / Cook's distance: identifies influential outliers — points that disproportionately move the fit.

Common violations and fixes

Nonlinearity: add polynomial terms, log-transform, or use spline / tree-based models.
Heteroscedasticity: use robust (White) standard errors; coefficients stay unbiased.
Correlated errors: time-series data needs autoregressive errors; panel data needs random or fixed effects.
Outliers: investigate, don't auto-drop. Sometimes they're the most important observations; sometimes they're data entry errors.
Multicollinearity: drop redundant predictors, combine them (PCA), or regularize (ridge/lasso).