Random Forests — Section 3: Tree Ensembles

Random forests train many decision trees on bootstrap samples of the training data, with feature subsampling at each split, then average their predictions (regression) or vote (classification).

Why bootstrap

Each tree sees a sample of size $n$ drawn with replacement from the training set of size $n$ . About 63% of unique rows appear in any given sample; the other ~37% are "out of bag" for that tree. This:

Decorrelates trees so their averaged variance falls faster than a sum of correlated variances would
Provides a free validation estimate via out-of-bag (OOB) predictions

Why feature subsampling

At each split, only $m$ of the $p$ features are considered. For classification, $m = \sqrt{p}$ is the default; for regression, $m = p/3$ . This further decorrelates trees — without it, every tree would split on the dominant feature first and you'd get the same tree many times.

Variance arithmetic

The variance of the average of $B$ identically distributed (but correlated) RVs with variance $\sigma^2$ and pairwise correlation $\rho$ is:

\rho \sigma^2 + \frac{1-\rho}{B} \sigma^2

As $B \to \infty$ , the second term vanishes but the first doesn't. Reducing $\rho$ via bootstrap + feature subsampling is the only way to push variance below $\rho \sigma^2$ .

Practical defaults that work

500–1000 trees (more is fine but diminishing returns)
Trees grown deep, no pruning (variance reduction does the regularization)
$m = \sqrt{p}$ for classification
min_samples_leaf $\geq 5$ for noisy targets

When to prefer over GBMs

Random forests need almost no tuning, train trivially in parallel, and rarely overfit. Gradient boosting usually wins on benchmark accuracy but needs careful learning-rate and depth tuning. Pick RF for fast prototypes; pick GBM when squeezing every last point matters.