Most classifiers output a score in that's interpreted as a probability. Two separate questions:
1. Is the score actually calibrated? (When the model says 0.7, do those examples really happen 70% of the time?) 2. What threshold should you use for decisions?
Calibration
A model is well-calibrated if, among examples with predicted probability , the actual rate of positives is .
- Logistic regression with proper training is usually well calibrated.
- Tree ensembles and especially boosting are typically over-confident — they push predictions to 0 and 1 more than warranted.
- Neural networks with cross-entropy can be over-confident, especially after heavy training.
Diagnose with a reliability plot: bin predictions by confidence, plot mean predicted probability vs actual fraction positive. A perfect diagonal means well-calibrated.
Calibration fixes
- Platt scaling: fit a logistic regression on a held-out set. Two parameters, simple, works for monotonic miscalibration.
- Isotonic regression: fit a monotone step function from score to probability. More flexible than Platt; needs more data.
- Both are post-hoc — you train your model once, then learn the mapping on validation data.
Decision thresholds
Default 0.5 is rarely optimal. The optimal threshold depends on:
- Class prevalence
- Relative cost of FP vs FN
- The downstream decision (suggest vs auto-act)
Sweep the threshold on a validation set, optimizing the actual metric you care about — F1, precision-at-recall, expected utility. For class imbalance, threshold moving alone usually beats fancy resampling techniques.
When calibrated probabilities matter
- Risk scoring (insurance, medical) — you need to *act on* the probability, not just rank.
- Ensembling / stacking — uncalibrated scores combine badly.
- Anomaly detection with a precision target — calibration directly affects how many alerts you generate.
When calibration doesn't matter (and you can ignore it): ranking tasks like search, recommendation, ads CTR where you only need the relative order to be right.