Picking learning rates, regularization strengths, tree depths, and so on accounts for substantial accuracy variation. Four approaches.
Grid search
Enumerate every combination from a Cartesian product of candidate values. Simple, but exponential in number of hyperparameters. Only useful for 1–3 hyperparameters.
Random search
Sample combinations randomly from specified distributions. Surprisingly effective: when only a few hyperparameters dominate (typical case), random search beats grid for the same compute budget — grid spends most of its samples varying parameters that don't matter. Bergstra & Bengio (2012) made this argument convincingly.
For most projects, random search of ~50–100 trials is the right starting point.
Bayesian optimization
Treat the validation metric as a function of hyperparameters and fit a probabilistic model (typically a Gaussian process) to it. Use the model to pick the next point to evaluate by trading off exploration (uncertain regions) and exploitation (promising regions).
Tools: Optuna, Ax, Hyperopt, scikit-optimize. Works well when:
- Trials are expensive (each costs hours)
- 5–20 hyperparameters
- Continuous or ordinal search space
Less useful when individual trials are very cheap (random search wins) or when the search space is huge with many useless dimensions.
Hyperband / ASHA
Allocate small budgets to many configurations, then progressively more compute to promising ones. Resources spent on bad configurations are killed early. Especially useful when partial training is informative — typical of deep learning.
ASHA (Asynchronous Hyperband) is the standard modern version. Built into Ray Tune.
What to actually do
- 1–3 hyperparameters, fast trials: grid or random.
- 4–10 hyperparameters, slow trials: Bayesian or random.
- Deep learning with intermediate metrics: ASHA.
- Always: log everything (config + metric + commit hash). The hyperparameter you've already tried is the most expensive one to retry.