What the p-value actually tells you (and doesn't)

The p-value is the most widely misreported number in science.

Most people, including researchers who should know better, read a p-value of 0.03 as "there's a 3% chance the effect is due to chance" or "there's a 97% chance the effect is real." Both readings are wrong.

What it actually is

The p-value is the probability of observing data at least as extreme as what you saw, assuming the null hypothesis is true.

Read that again. Three things to notice.

It is a probability about data, not about hypotheses.
It is conditioned on the null being true.
It includes all more-extreme outcomes, not just the one you got.

So a p-value of 0.03 says: if there were no real effect, you would see data this extreme or more extreme only 3 percent of the time.

What it does not say:

It does not say there is a 3 percent chance no effect exists.
It does not say there is a 97 percent chance the effect is real.
It does not say anything about how large or important the effect is.

Why the misreading is so sticky

The correct reading requires keeping track of what is conditioned on what. P(data | null) is what the p-value is. P(null | data) is what people want. These are different quantities, and the relationship between them depends on a prior.

Bayes' theorem connects them:

P(null | data) = P(data | null) · P(null) / P(data)

You need P(null), the prior probability the null is true, and P(data), the marginal probability of the data, to get P(null | data). The p-value alone does not.

This isn't a pedantic quibble. If you ran the same test in a field where the null is almost never true, a small p-value is weak evidence the null is false. If you ran it where the null is almost always true, the same small p-value is much stronger evidence. The p-value doesn't know the difference. A Bayesian calculation does.

What the p-value is good for

Specifically: one calibrated signal, in a larger decision.

Use it to:

Judge whether your result would be surprising under the null.
Set a threshold for further investigation.
Compare evidence across studies that share a null and a design.

Do not use it to:

Claim an effect is real.
Rank effect sizes.
Decide whether a treatment works in practice.

The bigger problem

p < 0.05 became a bright line because it's convenient, not because it's principled. A study with a tiny effect and a huge sample can clear it easily. A study with a large effect and a small sample can fail it. The line says nothing about whether the effect matters.

A drug that lowers blood pressure by 0.3 mmHg with p = 0.001 is statistically significant and clinically useless. A drug that lowers it by 10 mmHg with p = 0.08 is statistically non-significant and potentially important. The p-value ranks neither.

The working rule

Treat a small p-value as permission to look more carefully, not as the answer. The answer is about effect size, replicability, and whether the thing you measured is the thing you meant to measure.

p < 0.05 passes the null test. It does not pass the truth test. Keep the two separate.

All articles