Logistic regression predicts where . Equivalently, the log-odds are linear:
Why log-odds
Log-odds are unbounded , which is where a linear function naturally lives. Probabilities are bounded , which is where the data lives. The sigmoid is the bridge.
A coefficient has a clean interpretation: increasing by 1 multiplies the odds of by . Doubling odds at , etc.
Loss and optimization
Cross-entropy (negative log-likelihood):
This is convex — guaranteed unique global minimum, no local-optima drama. Solve with gradient descent, IRLS, or coordinate descent. With regularization (almost always in practice), same story.
Why it's still everywhere
- Calibrated probabilities out of the box. Trees and SVMs don't naturally give you these.
- Interpretable coefficients in the right units (log-odds).
- Fast to train even on huge sparse data — lookup table for one-hot features.
- Baseline that's surprisingly hard to beat in tabular settings when feature engineering is good.
When it falls down
Non-linear decision boundaries. The model is fundamentally linear in features. Engineer interactions or use a non-linear model (trees, networks). Class imbalance also distorts the intercept significantly — fix the decision threshold post hoc, don't rely on the natural 0.5 cutoff.