Why this exists
“Significant” doesn’t tell you if a result is big enough to matter. With enough data, even a tiny, useless change can be statistically significant. Effect size fixes that by answering a different question: How large is the difference?
Effect size is the “so what?” number. P‑values answer "is there an effect?"; effect size answers "how large is it?" You need both to make a good decision.
What is effect size?
It’s a single number that describes the magnitude of a difference or relationship. For comparing two averages (like time on page, revenue per user, test scores), the most common choice is Cohen’s d.
- Cohen’s d: difference between two means measured in units of a standard deviation
- Think of it as: how many “spread units” apart the groups are
- Useful for comparing across studies, metrics, and sample sizes
Cohen’s d, in one sentence
It’s the difference in means divided by a typical amount of spread. If the difference is half a standard deviation, then d = 0.5.
A 60‑second example
You test a new onboarding screen.
- Control average time to complete: 120 seconds
- Variant average time: 132 seconds
- Typical spread (standard deviation): about 60 seconds
Cohen’s d is:
d = (132 − 120) / 60 = 0.20
That’s a small effect. Whether to ship depends on context: does a small time increase help activation? Is there a trade‑off with completion rate? Effect size helps you weigh that.
How to compute it (without getting lost in math)
For two independent groups, Cohen’s d uses a pooled standard deviation (a blended estimate of spread):
s_p = sqrt( ((n1−1)·s1^2 + (n2−1)·s2^2) / (n1 + n2 − 2) )
d = (mean2 − mean1) / s_p
Practical guidance:
- If the group spreads (
s1,s2) look similar, using their pooled (or even average) SD is fine for intuition. - If spreads are very different, consider Glass’s Δ: divide by the control group’s SD.
- For before/after with the same people, use a paired version (standardize by the SD of the differences).
What the numbers mean (use context!)
Common conventions for absolute values of d:
0.2≈ small (noticeable with lots of users)0.5≈ medium (clear, meaningful)0.8≈ large (substantial, obvious)
These are rules of thumb. A “small” effect can be huge at scale (conversion rate), while a “large” effect on a noisy vanity metric might be irrelevant. Always describe the practical impact in your own units (e.g., seconds saved, dollars per user, % activated).
What to report
- The effect size (
d) and which group is “better” (sign matters) - A 95% confidence interval around
d(how precise you are) - The practical translation in your units (“about 12 seconds slower per user”)
- Any assumptions (independent groups, similar spreads; or say if you used Glass’s Δ or a paired approach)
Confidence tells you how sure you are; effect size tells you how big it is. Report both.
When not to use Cohen’s d
- Binary outcomes (converted vs. not): use risk difference, risk ratio, or odds ratio instead
- Very different spreads between groups: consider Glass’s Δ or Welch‑based approaches
- Same people measured twice: use a paired effect size
You can try interactive tools here:
For the curious (optional details)
- Small samples tend to make
da bit optimistic. A common correction is Hedges’ g, which slightly shrinksdtoward zero. If you’re publishing or samples are small, reportgas well. - If you already have a t‑test:
d ≈ t × sqrt(1/n1 + 1/n2)for independent groups (a quick back‑of‑the‑envelope link between significance and size).
TL;DR
- Effect size answers “how large is the difference?”—the decision‑making piece p‑values can’t give you
- Cohen’s d standardizes a mean difference using a typical spread (standard deviation)
- Use context and confidence intervals; choose alternatives for binary, paired, or highly unequal‑spread cases