Hypothesis testing

The frequentist recipe for deciding whether a dataset rules out a specific parameter value:

  1. State a null hypothesis and an alternative .
  2. Pick a test statistic whose sampling distribution is known under the null.
  3. Observe the data, compute the statistic.
  4. Report the probability of seeing something at least as extreme under the null — the p-value.
z = 1.6000.10.20.30.40.5−4−2024z = √n · (x̄ − μ₀)density
two-sided p-value = 0.1096 — reject H₀ at α = 0.05? no
critical values ±1.960 · standard error 0.224 · power vs. true μ = 0.40: 43.2%
null N(0, 1) alternative N(μ√n, 1) rejection region
Under H₀: μ = 0, the test statistic is standard Normal. P-value is the two-sided tail area past the observed z. Power (green vs. rejection region) rises with |μ| and with √n.

What to notice

  • P-value = tail area past the observed z. The rose shaded regions are the rejection zone . Drag the observed z-line across the boundary and watch the p-value cross α.
  • Power is the alternative’s tail inside the rejection region. The dashed green curve is the distribution under the true μ. Its mass that falls inside the rejection zone is the probability the test detects the effect. Raising squeezes the alternative toward its mean , sending power toward 1.
  • P-values are uniform under the null. At true μ = 0, the observed z is itself Normal(0, 1), so the tail probability is uniform on [0, 1]. That’s why “p < 0.05 by chance” occurs 5% of the time when the null is true — not a bug, the definition.

What a p-value is not

  • Not the probability that H₀ is true. That would require a prior over H₀ (and you’d need Bayes).
  • Not the probability that the result is due to chance. “Chance” isn’t a well-defined event.
  • Not a measure of effect size. A tiny effect with huge n hits p < 0.001; a huge effect with tiny n can sit at p = 0.3.

Power, effect size, and sample size

These three quantities are locked together. Fixing any two determines the third:

Practically:

  • Power analysis before the study. Plug in the smallest effect you care about and the power you want (typically 0.8); solve for n.
  • Post-hoc power isn’t a thing. Computing power at the observed effect size is circular — it’s just a transformation of the p-value.