Logistic Regression — Section 2: Linear Models

Logistic regression predicts $P(y=1 | x) = \sigma(\beta^T x)$ where $\sigma(z) = 1/(1+e^{-z})$ . Equivalently, the log-odds are linear:

\log \frac{P(y=1|x)}{P(y=0|x)} = \beta^T x

Why log-odds

Log-odds are unbounded $(-\infty, \infty)$ , which is where a linear function naturally lives. Probabilities are bounded $[0,1]$ , which is where the data lives. The sigmoid is the bridge.

A coefficient $\beta_j$ has a clean interpretation: increasing $x_j$ by 1 multiplies the odds of $y=1$ by $e^{\beta_j}$ . Doubling odds at $\beta_j = \ln 2 \approx 0.69$ , etc.

Loss and optimization

Cross-entropy (negative log-likelihood):

L = -\sum_i y_i \log \hat{p}_i + (1-y_i) \log(1-\hat{p}_i)

This is convex — guaranteed unique global minimum, no local-optima drama. Solve with gradient descent, IRLS, or coordinate descent. With regularization (almost always $L_2$ in practice), same story.

Why it's still everywhere

Calibrated probabilities out of the box. Trees and SVMs don't naturally give you these.
Interpretable coefficients in the right units (log-odds).
Fast to train even on huge sparse data — lookup table for one-hot features.
Baseline that's surprisingly hard to beat in tabular settings when feature engineering is good.

When it falls down

Non-linear decision boundaries. The model is fundamentally linear in features. Engineer interactions or use a non-linear model (trees, networks). Class imbalance also distorts the intercept significantly — fix the decision threshold post hoc, don't rely on the natural 0.5 cutoff.