Decision Trees — Section 3: Tree Ensembles

A decision tree partitions the input space with axis-aligned splits and predicts a constant in each region (mean for regression, majority class for classification).

Greedy construction

At each node, pick the (feature, threshold) pair that maximally reduces an impurity measure. For classification:

Gini impurity $\sum_k p_k(1-p_k) = 1 - \sum_k p_k^2$
Entropy $-\sum_k p_k \log p_k$

Both give similar splits in practice — Gini is slightly cheaper. For regression, the natural choice is variance reduction (= MSE reduction).

The algorithm is greedy: at each step, take the locally best split. No backtracking. Optimal global trees are NP-hard, so we live with greediness — and ensemble away the inaccuracy.

Why single trees fail

High variance: a tiny change in the training data can produce a completely different tree (different first split → cascading divergence).
Discontinuous predictions: predictions jump at split boundaries. For smooth targets this is wasteful.
Bias toward features with many values: continuous features and high-cardinality categoricals can dominate splits artificially.

Why they're still useful

Handle mixed feature types and missing values natively
Robust to feature scaling
Catch non-linear interactions without explicit feature engineering
Interpretable — you can literally read the tree

The standard fix to the variance problem is bagging (random forests) or boosting (gradient boosting). A lone decision tree is mostly useful as a teaching tool or for cases where extreme interpretability matters.