Accuracy alone is a poor metric for most real classification problems — especially with imbalanced classes. The right metric depends on the costs of different errors.
Confusion matrix
For binary classification, four cells: True Positives, False Positives, True Negatives, False Negatives. Every metric is some ratio of these four numbers.
Precision and Recall
- Precision = TP / (TP + FP). Of the items we said were positive, what fraction actually are? Important when false positives are costly (spam detection — don't flag legit emails).
- Recall (sensitivity, TPR) = TP / (TP + FN). Of all the truly positive items, what fraction did we catch? Important when false negatives are costly (medical screening — don't miss the disease).
F1 score
Harmonic mean of precision and recall: . Useful when you care about both equally. generalizes: weighs recall more, weighs precision more.
ROC and AUC
The Receiver Operating Characteristic plots TPR vs FPR as the decision threshold varies. AUC is the area under this curve — equals the probability that a random positive scores higher than a random negative. AUC = 0.5 is random; 1.0 is perfect. Threshold-independent, which is both its strength and weakness.
Precision-Recall curve
PR curves are usually MORE informative than ROC for imbalanced data. ROC can look great with terrible PR; AUC-PR is the area under the PR curve.
Log loss
. Cares about CALIBRATION — penalizes confidently wrong predictions heavily. Used for probabilistic forecasting (weather, sports betting, advertising).
When accuracy is fine
Balanced classes, symmetric error costs, mostly-correct predictions. For most real problems, at least one of those fails and you need precision/recall/AUC instead.