Random Forests and Gradient Boosting — Section 5: Classification

Single decision trees are inaccurate; ensembles of trees dominate tabular ML. Two main flavors: bagging (random forests) and boosting (gradient boosting).

Random forests (bagging)

Train many trees in parallel, each on a bootstrap sample of the data, considering only a random subset of features at each split. Average their predictions. The randomness decorrelates errors — averaging brings variance down without raising bias.

Key hyperparameters: number of trees (more is better up to a plateau), max depth (deeper = lower bias, higher variance), features per split (sqrt(p) for classification, p/3 for regression).

Gradient boosting (sequential)

Train trees one at a time, each correcting the previous ensemble's errors. Mathematically: gradient descent in function space. Each new tree fits the negative gradient of the loss with respect to the current model's predictions.

Implementations: XGBoost, LightGBM, CatBoost — each with engineering refinements (histogram-based splits, leaf-wise growth, categorical handling). LightGBM is usually the fastest; XGBoost is the most battle-tested; CatBoost handles categoricals best out of the box.

Hyperparameters that matter

Learning rate (shrinkage): scale factor applied to each new tree. Smaller = more trees needed but better generalization. Start at 0.05-0.1.
Number of trees: select via early stopping on a validation set.
Max depth / num leaves: controls each tree's capacity. Smaller (4-8) is usually better.
Subsample: fraction of training data per tree. Adds randomness, reduces overfit.
L2 reg / min child weight: directly regularize.

Random forest vs gradient boosting

Random forests: easier to tune, naturally parallel, less prone to overfit, less accurate at the limit. Gradient boosting: requires careful tuning (especially learning rate and tree count) but achieves the best accuracy on most tabular benchmarks.

When tree-based beats neural networks

For tabular data with mixed types, modest size, and tabular relationships, tree ensembles consistently match or beat deep learning. Neural networks win when data has spatial/sequential structure (images, text) and is large.

Single decision trees are inaccurate; ensembles of trees dominate tabular ML. Two main flavors: bagging (random forests) and boosting (gradient boosting).

Random forests (bagging)

Key hyperparameters: number of trees (more is better up to a plateau), max depth (deeper = lower bias, higher variance), features per split (sqrt(p) for classification, p/3 for regression).

Gradient boosting (sequential)

Hyperparameters that matter

Learning rate (shrinkage): scale factor applied to each new tree. Smaller = more trees needed but better generalization. Start at 0.05-0.1.
Number of trees: select via early stopping on a validation set.
Max depth / num leaves: controls each tree's capacity. Smaller (4-8) is usually better.
Subsample: fraction of training data per tree. Adds randomness, reduces overfit.
L2 reg / min child weight: directly regularize.