Hyperparameter Tuning — Section 7: Evaluation and Model Selection

Picking learning rates, regularization strengths, tree depths, and so on accounts for substantial accuracy variation. Four approaches.

Grid search

Enumerate every combination from a Cartesian product of candidate values. Simple, but exponential in number of hyperparameters. Only useful for 1–3 hyperparameters.

Random search

Sample combinations randomly from specified distributions. Surprisingly effective: when only a few hyperparameters dominate (typical case), random search beats grid for the same compute budget — grid spends most of its samples varying parameters that don't matter. Bergstra & Bengio (2012) made this argument convincingly.

For most projects, random search of ~50–100 trials is the right starting point.

Bayesian optimization

Treat the validation metric as a function of hyperparameters and fit a probabilistic model (typically a Gaussian process) to it. Use the model to pick the next point to evaluate by trading off exploration (uncertain regions) and exploitation (promising regions).

Tools: Optuna, Ax, Hyperopt, scikit-optimize. Works well when:

Trials are expensive (each costs hours)
5–20 hyperparameters
Continuous or ordinal search space

Less useful when individual trials are very cheap (random search wins) or when the search space is huge with many useless dimensions.

Hyperband / ASHA

Allocate small budgets to many configurations, then progressively more compute to promising ones. Resources spent on bad configurations are killed early. Especially useful when partial training is informative — typical of deep learning.

ASHA (Asynchronous Hyperband) is the standard modern version. Built into Ray Tune.

What to actually do

1–3 hyperparameters, fast trials: grid or random.
4–10 hyperparameters, slow trials: Bayesian or random.
Deep learning with intermediate metrics: ASHA.
Always: log everything (config + metric + commit hash). The hyperparameter you've already tried is the most expensive one to retry.