Calibration and Decision Thresholds — Section 7: Evaluation and Model Selection

Most classifiers output a score in $[0, 1]$ that's interpreted as a probability. Two separate questions:

1. Is the score actually calibrated? (When the model says 0.7, do those examples really happen 70% of the time?) 2. What threshold should you use for decisions?

Calibration

A model is well-calibrated if, among examples with predicted probability $p$ , the actual rate of positives is $p$ .

Logistic regression with proper training is usually well calibrated.
Tree ensembles and especially boosting are typically over-confident — they push predictions to 0 and 1 more than warranted.
Neural networks with cross-entropy can be over-confident, especially after heavy training.

Diagnose with a reliability plot: bin predictions by confidence, plot mean predicted probability vs actual fraction positive. A perfect diagonal means well-calibrated.

Calibration fixes

Platt scaling: fit a logistic regression $P(y=1) = \sigma(a \cdot \text{score} + b)$ on a held-out set. Two parameters, simple, works for monotonic miscalibration.
Isotonic regression: fit a monotone step function from score to probability. More flexible than Platt; needs more data.
Both are post-hoc — you train your model once, then learn the mapping on validation data.

Decision thresholds

Default 0.5 is rarely optimal. The optimal threshold depends on:

Class prevalence
Relative cost of FP vs FN
The downstream decision (suggest vs auto-act)

Sweep the threshold on a validation set, optimizing the actual metric you care about — F1, precision-at-recall, expected utility. For class imbalance, threshold moving alone usually beats fancy resampling techniques.

When calibrated probabilities matter

Risk scoring (insurance, medical) — you need to *act on* the probability, not just rank.
Ensembling / stacking — uncalibrated scores combine badly.
Anomaly detection with a precision target — calibration directly affects how many alerts you generate.

When calibration doesn't matter (and you can ignore it): ranking tasks like search, recommendation, ads CTR where you only need the relative order to be right.