Gradient Boosting — Section 3: Tree Ensembles

Gradient boosting builds an ensemble *additively*: start with a constant prediction, then iteratively fit a new weak learner (typically a shallow tree) to the gradient of the loss with respect to current predictions.

The recipe

1. Initialize $F_0(x) =$ optimal constant (mean for MSE, log-odds for log-loss). 2. For $m = 1, \dots, M$ : - Compute pseudo-residuals $r_{im} = -\partial L(y_i, F_{m-1}(x_i)) / \partial F_{m-1}(x_i)$ . - Fit a tree $h_m$ to predict $r_{im}$ . - Update $F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$ where $\nu$ is the learning rate.

For squared loss, pseudo-residuals are literal residuals $y_i - F(x_i)$ , which is where the name comes from.

Why it works

Each tree is trained on what the current ensemble gets *wrong*. The learning rate $\nu$ (typically 0.05–0.1) prevents each tree from overcorrecting and lets the next one refine further. Many small steps generalize better than a few big ones — same intuition as SGD.

Knobs that actually matter

Number of trees $M$ : too few → underfit; too many → overfit. Use early stopping on validation.
Learning rate $\nu$ : smaller is better for generalization but slower to train. Trade off against $M$ .
Tree depth: shallower (3–8) prevents overfitting; deeper captures more interactions.
Subsampling: stochastic gradient boosting trains each tree on a fraction of rows.

Why it dominates tabular benchmarks

Trees handle non-linearities and feature interactions natively; boosting handles model selection by adding capacity incrementally; the learning rate decouples capacity from optimization. The combination is hard to beat on tabular data — even now, in 2026, XGBoost/LightGBM/CatBoost win most non-text non-image competitions.

The recipe

For squared loss, pseudo-residuals are literal residuals $y_i - F(x_i)$ , which is where the name comes from.

Why it works

Knobs that actually matter

Number of trees $M$ : too few → underfit; too many → overfit. Use early stopping on validation.
Learning rate $\nu$ : smaller is better for generalization but slower to train. Trade off against $M$ .
Tree depth: shallower (3–8) prevents overfitting; deeper captures more interactions.
Subsampling: stochastic gradient boosting trains each tree on a fraction of rows.