Gradient boosting builds an ensemble *additively*: start with a constant prediction, then iteratively fit a new weak learner (typically a shallow tree) to the gradient of the loss with respect to current predictions.
The recipe
1. Initialize optimal constant (mean for MSE, log-odds for log-loss). 2. For : - Compute pseudo-residuals . - Fit a tree to predict . - Update where is the learning rate.
For squared loss, pseudo-residuals are literal residuals , which is where the name comes from.
Why it works
Each tree is trained on what the current ensemble gets *wrong*. The learning rate (typically 0.05–0.1) prevents each tree from overcorrecting and lets the next one refine further. Many small steps generalize better than a few big ones — same intuition as SGD.
Knobs that actually matter
- Number of trees : too few → underfit; too many → overfit. Use early stopping on validation.
- Learning rate : smaller is better for generalization but slower to train. Trade off against .
- Tree depth: shallower (3–8) prevents overfitting; deeper captures more interactions.
- Subsampling: stochastic gradient boosting trains each tree on a fraction of rows.
Why it dominates tabular benchmarks
Trees handle non-linearities and feature interactions natively; boosting handles model selection by adding capacity incrementally; the learning rate decouples capacity from optimization. The combination is hard to beat on tabular data — even now, in 2026, XGBoost/LightGBM/CatBoost win most non-text non-image competitions.