Loss Functions — Section 1: Supervised Learning Foundations

Your choice of loss isn't a technical detail — it defines what "correct" means for your model. Two models trained on the same data with different losses can have completely different behavior.

Regression losses

Squared error $L = (y - \hat{y})^2$ : penalizes large errors disproportionately. Sensitive to outliers. The conditional mean is the optimal predictor.
Absolute error $L = |y - \hat{y}|$ : outlier-robust. The conditional median is optimal.
Huber $L_\delta = \frac{1}{2}(y-\hat{y})^2$ for $|y-\hat{y}| \leq \delta$ , linear beyond: smooth combination. Standard for robust regression.
Quantile $L_\tau = \max(\tau e, (\tau-1)e)$ where $e = y - \hat{y}$ : optimal predictor is the $\tau$ -quantile. Use for asymmetric costs (overestimating delivery time vs underestimating).

Classification losses

Cross-entropy $L = -\sum_k y_k \log \hat{p}_k$ : the standard choice. Forces the model to output calibrated probabilities. Steep gradient when confident and wrong — fast learning signal.
Hinge $L = \max(0, 1 - y \hat{f}(x))$ : SVM loss. Doesn't reward beyond-margin correctness; zero gradient there. Doesn't give probabilities.
Focal $L = -(1-\hat{p}_y)^\gamma \log \hat{p}_y$ : down-weights easy examples. Used for class imbalance, especially in detection.

Picking a loss

Start from the business cost. If false positives and false negatives cost the same, cross-entropy is fine. If they don't, either use a weighted loss or tune the decision threshold after training (cheaper and more flexible).

Your choice of loss isn't a technical detail — it defines what "correct" means for your model. Two models trained on the same data with different losses can have completely different behavior.

Regression losses

Squared error $L = (y - \hat{y})^2$ : penalizes large errors disproportionately. Sensitive to outliers. The conditional mean is the optimal predictor.
Absolute error $L = |y - \hat{y}|$ : outlier-robust. The conditional median is optimal.
Huber $L_\delta = \frac{1}{2}(y-\hat{y})^2$ for $|y-\hat{y}| \leq \delta$ , linear beyond: smooth combination. Standard for robust regression.
Quantile $L_\tau = \max(\tau e, (\tau-1)e)$ where $e = y - \hat{y}$ : optimal predictor is the $\tau$ -quantile. Use for asymmetric costs (overestimating delivery time vs underestimating).

Classification losses

Cross-entropy $L = -\sum_k y_k \log \hat{p}_k$ : the standard choice. Forces the model to output calibrated probabilities. Steep gradient when confident and wrong — fast learning signal.
Hinge $L = \max(0, 1 - y \hat{f}(x))$ : SVM loss. Doesn't reward beyond-margin correctness; zero gradient there. Doesn't give probabilities.
Focal $L = -(1-\hat{p}_y)^\gamma \log \hat{p}_y$ : down-weights easy examples. Used for class imbalance, especially in detection.