A decision tree partitions the input space with axis-aligned splits and predicts a constant in each region (mean for regression, majority class for classification).
Greedy construction
At each node, pick the (feature, threshold) pair that maximally reduces an impurity measure. For classification:
- Gini impurity
- Entropy
Both give similar splits in practice — Gini is slightly cheaper. For regression, the natural choice is variance reduction (= MSE reduction).
The algorithm is greedy: at each step, take the locally best split. No backtracking. Optimal global trees are NP-hard, so we live with greediness — and ensemble away the inaccuracy.
Why single trees fail
- High variance: a tiny change in the training data can produce a completely different tree (different first split → cascading divergence).
- Discontinuous predictions: predictions jump at split boundaries. For smooth targets this is wasteful.
- Bias toward features with many values: continuous features and high-cardinality categoricals can dominate splits artificially.
Why they're still useful
- Handle mixed feature types and missing values natively
- Robust to feature scaling
- Catch non-linear interactions without explicit feature engineering
- Interpretable — you can literally read the tree
The standard fix to the variance problem is bagging (random forests) or boosting (gradient boosting). A lone decision tree is mostly useful as a teaching tool or for cases where extreme interpretability matters.