Hypothesis testing is a framework for asking "is the observed difference statistically meaningful, or could it have arisen by chance?" The mechanics are mechanical; the interpretation is where everyone trips up.
Null and alternative
The null hypothesis is the default — usually "no effect" or "no difference." The alternative is what you'd like to demonstrate. You assume is true and compute the probability of seeing data as extreme as yours.
p-value
The probability of observing data at least as extreme as what you saw, assuming is true. Small → the data would be unusual if held → reject . Conventional threshold: .
What p is NOT
- NOT "the probability that is true"
- NOT "the probability that the effect is real"
- NOT "the probability of a false positive across all your experiments"
It is purely a statement under the null. To talk about the probability is true, you need Bayesian inference and a prior.
Type I and Type II errors
Type I (false positive): rejecting when it's true. The probability is — your significance threshold. Type II (false negative): failing to reject when is true. The probability is ; statistical power is .
One-tailed vs two-tailed
A two-tailed test rejects if the effect is large in EITHER direction. One-tailed rejects only in one. Choose one-tailed only if you've pre-specified the direction and you wouldn't care about an effect in the other direction. One-tailed has more power; using one without pre-registering is p-hacking.
Multiple testing
Run 20 independent tests at and you'd expect ~1 false positive by chance. Correction methods: Bonferroni (divide by number of tests, conservative), Benjamini-Hochberg (controls false discovery rate, less conservative). Always correct when testing many hypotheses.