Decision Trees — Section 5: Classification

A decision tree splits the data into branches by repeatedly choosing the predictor and threshold that best separates the response. Conceptually simple, interpretable, and forms the basis of random forests and gradient boosting — the dominant methods for tabular data.

How splits are chosen

For classification, measure node "purity" with Gini impurity or entropy. For regression, use variance reduction. At each node, try every candidate split (every variable × every threshold) and pick the one with the largest purity gain. Recurse.

Stopping

Trees overfit by default — keep splitting and every leaf will have a single training example. Stop when: max depth reached, leaf has too few samples, or purity gain is too small. Or grow fully then prune back using cross-validation.

Strengths

Handle mixed data types (numeric + categorical) without preprocessing
Capture nonlinear interactions automatically
Output interpretable rules
Invariant to monotonic transforms (no need to standardize)
Naturally handle missing values (via surrogate splits)

Weaknesses

High variance — small data changes can flip large parts of the tree
Greedy splitting can miss globally good partitions
Poor at smooth functions; better to use a method like regression for linear-ish data
Decision boundaries are axis-aligned — diagonal patterns require deep trees

CART vs C4.5 vs ID3

CART (used by scikit-learn) handles both classification and regression with Gini or variance. C4.5 / C5 are older classification-focused variants using entropy. The differences barely matter for tabular work today — everyone uses CART-style trees as building blocks for ensembles.

When to use a single tree

Almost never. Single trees are interpretable but inaccurate. For real predictions use random forests or gradient boosting (next lessons). For interpretation when accuracy matters less, depth-limited single trees (depth 3-4) provide simple rules.

How splits are chosen

Stopping

Strengths

Handle mixed data types (numeric + categorical) without preprocessing
Capture nonlinear interactions automatically
Output interpretable rules
Invariant to monotonic transforms (no need to standardize)
Naturally handle missing values (via surrogate splits)

Weaknesses

High variance — small data changes can flip large parts of the tree
Greedy splitting can miss globally good partitions
Poor at smooth functions; better to use a method like regression for linear-ish data
Decision boundaries are axis-aligned — diagonal patterns require deep trees