XGBoost Internals — Section 3: Tree Ensembles

XGBoost is gradient boosting plus three ideas: a regularized objective, second-order optimization, and serious systems engineering. Understanding it tells you most of what's different about modern GBMs (LightGBM, CatBoost).

Regularized objective

\text{Obj} = \sum_i L(y_i, \hat{y}_i) + \sum_t \Omega(f_t), \quad \Omega(f) = \gamma T + \frac{1}{2} \lambda ||w||^2

$T$ is the number of leaves, $w$ is the vector of leaf scores. The $\gamma T$ penalty discourages adding leaves (similar to pruning); the $\lambda ||w||^2$ shrinks leaf scores.

Second-order Taylor expansion

Approximate the loss around current predictions $\hat{y}^{(t-1)}$ :

L(y, \hat{y}^{(t-1)} + f_t(x)) \approx L(y, \hat{y}^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2

where $g_i$ is the gradient and $h_i$ the Hessian. This lets the optimal leaf value have a closed form:

w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}

and the optimal split-gain formula:

\text{Gain} = \frac{1}{2}\left[\frac{(\sum g_L)^2}{\sum h_L + \lambda} + \frac{(\sum g_R)^2}{\sum h_R + \lambda} - \frac{(\sum g)^2}{\sum h + \lambda}\right] - \gamma

Splits with gain $\leq 0$ are pruned. This is principled regularization baked into the split criterion.

Systems tricks

Column block storage: features sorted once, then reused across iterations. Splits become linear scans.
Approximate split finding: weighted quantile sketch over candidate split points — required for huge datasets where exact enumeration is too expensive.
Sparse-aware splits: missing values get a "default direction" learned automatically. No need to impute.
Cache-aware prefetching, out-of-core training: industrial-scale capability that random-forest implementations historically lacked.

What LightGBM and CatBoost change

LightGBM: histogram-based splits (bucket continuous features into ~256 bins) → much faster training. Leaf-wise tree growth → potentially deeper trees, sometimes overfits.
CatBoost: native categorical encoding using ordered target statistics. Better default handling of categorical features without one-hot blowup.