Single decision trees are inaccurate; ensembles of trees dominate tabular ML. Two main flavors: bagging (random forests) and boosting (gradient boosting).
Random forests (bagging)
Train many trees in parallel, each on a bootstrap sample of the data, considering only a random subset of features at each split. Average their predictions. The randomness decorrelates errors — averaging brings variance down without raising bias.
Key hyperparameters: number of trees (more is better up to a plateau), max depth (deeper = lower bias, higher variance), features per split (sqrt(p) for classification, p/3 for regression).
Gradient boosting (sequential)
Train trees one at a time, each correcting the previous ensemble's errors. Mathematically: gradient descent in function space. Each new tree fits the negative gradient of the loss with respect to the current model's predictions.
Implementations: XGBoost, LightGBM, CatBoost — each with engineering refinements (histogram-based splits, leaf-wise growth, categorical handling). LightGBM is usually the fastest; XGBoost is the most battle-tested; CatBoost handles categoricals best out of the box.
Hyperparameters that matter
- Learning rate (shrinkage): scale factor applied to each new tree. Smaller = more trees needed but better generalization. Start at 0.05-0.1.
- Number of trees: select via early stopping on a validation set.
- Max depth / num leaves: controls each tree's capacity. Smaller (4-8) is usually better.
- Subsample: fraction of training data per tree. Adds randomness, reduces overfit.
- L2 reg / min child weight: directly regularize.
Random forest vs gradient boosting
Random forests: easier to tune, naturally parallel, less prone to overfit, less accurate at the limit. Gradient boosting: requires careful tuning (especially learning rate and tree count) but achieves the best accuracy on most tabular benchmarks.
When tree-based beats neural networks
For tabular data with mixed types, modest size, and tabular relationships, tree ensembles consistently match or beat deep learning. Neural networks win when data has spatial/sequential structure (images, text) and is large.