A decision tree splits the data into branches by repeatedly choosing the predictor and threshold that best separates the response. Conceptually simple, interpretable, and forms the basis of random forests and gradient boosting — the dominant methods for tabular data.
How splits are chosen
For classification, measure node "purity" with Gini impurity or entropy. For regression, use variance reduction. At each node, try every candidate split (every variable × every threshold) and pick the one with the largest purity gain. Recurse.
Stopping
Trees overfit by default — keep splitting and every leaf will have a single training example. Stop when: max depth reached, leaf has too few samples, or purity gain is too small. Or grow fully then prune back using cross-validation.
Strengths
- Handle mixed data types (numeric + categorical) without preprocessing
- Capture nonlinear interactions automatically
- Output interpretable rules
- Invariant to monotonic transforms (no need to standardize)
- Naturally handle missing values (via surrogate splits)
Weaknesses
- High variance — small data changes can flip large parts of the tree
- Greedy splitting can miss globally good partitions
- Poor at smooth functions; better to use a method like regression for linear-ish data
- Decision boundaries are axis-aligned — diagonal patterns require deep trees
CART vs C4.5 vs ID3
CART (used by scikit-learn) handles both classification and regression with Gini or variance. C4.5 / C5 are older classification-focused variants using entropy. The differences barely matter for tabular work today — everyone uses CART-style trees as building blocks for ensembles.
When to use a single tree
Almost never. Single trees are interpretable but inaccurate. For real predictions use random forests or gradient boosting (next lessons). For interpretation when accuracy matters less, depth-limited single trees (depth 3-4) provide simple rules.