Random forests train many decision trees on bootstrap samples of the training data, with feature subsampling at each split, then average their predictions (regression) or vote (classification).
Why bootstrap
Each tree sees a sample of size drawn with replacement from the training set of size . About 63% of unique rows appear in any given sample; the other ~37% are "out of bag" for that tree. This:
- Decorrelates trees so their averaged variance falls faster than a sum of correlated variances would
- Provides a free validation estimate via out-of-bag (OOB) predictions
Why feature subsampling
At each split, only of the features are considered. For classification, is the default; for regression, . This further decorrelates trees — without it, every tree would split on the dominant feature first and you'd get the same tree many times.
Variance arithmetic
The variance of the average of identically distributed (but correlated) RVs with variance and pairwise correlation is:
As , the second term vanishes but the first doesn't. Reducing via bootstrap + feature subsampling is the only way to push variance below .
Practical defaults that work
- 500–1000 trees (more is fine but diminishing returns)
- Trees grown deep, no pruning (variance reduction does the regularization)
- for classification
- min_samples_leaf for noisy targets
When to prefer over GBMs
Random forests need almost no tuning, train trivially in parallel, and rarely overfit. Gradient boosting usually wins on benchmark accuracy but needs careful learning-rate and depth tuning. Pick RF for fast prototypes; pick GBM when squeezing every last point matters.