OLS gives unbiased coefficient estimates under a set of assumptions. Violations don't always invalidate the model — but you should know which ones bite and how badly.
The assumptions (Gauss-Markov)
1. Linearity: — the conditional mean is linear in the predictors. 2. Independence: residuals are uncorrelated. 3. Homoscedasticity: residuals have equal variance regardless of . 4. No perfect multicollinearity: is invertible. 5. Normality of residuals (only for inference, not estimation).
Diagnostic plots
- Residuals vs fitted: should look like noise. Patterns (curve, fan) suggest nonlinearity or heteroscedasticity.
- Q-Q plot: residual quantiles vs normal quantiles. Should be a straight line. Bowed = skew; S-shape = heavy tails.
- Scale-location: vs fitted. Slope ≠ 0 = heteroscedasticity.
- Leverage / Cook's distance: identifies influential outliers — points that disproportionately move the fit.
Common violations and fixes
- Nonlinearity: add polynomial terms, log-transform, or use spline / tree-based models.
- Heteroscedasticity: use robust (White) standard errors; coefficients stay unbiased.
- Correlated errors: time-series data needs autoregressive errors; panel data needs random or fixed effects.
- Outliers: investigate, don't auto-drop. Sometimes they're the most important observations; sometimes they're data entry errors.
- Multicollinearity: drop redundant predictors, combine them (PCA), or regularize (ridge/lasso).
When to give up on linearity
If residuals show clear curvature, log-transform or add polynomial terms. If a single transformation doesn't fix it, the problem is probably structural — use a tree-based or kernel method instead.