Effect Size (Cohen’s d) Explained Simply | Amine Biad

Why this exists

“Significant” doesn’t tell you if a result is big enough to matter. With enough data, even a tiny, useless change can be statistically significant. Effect size fixes that by answering a different question: How large is the difference?

Effect size is the “so what?” number. P‑values answer "is there an effect?"; effect size answers "how large is it?" You need both to make a good decision.

What is effect size?

It’s a single number that describes the magnitude of a difference or relationship. For comparing two averages (like time on page, revenue per user, test scores), the most common choice is Cohen’s d.

Cohen’s d: difference between two means measured in units of a standard deviation
Think of it as: how many “spread units” apart the groups are
Useful for comparing across studies, metrics, and sample sizes

Cohen’s d, in one sentence

It’s the difference in means divided by a typical amount of spread. If the difference is half a standard deviation, then d = 0.5.

A 60‑second example

You test a new onboarding screen.

Control average time to complete: 120 seconds
Variant average time: 132 seconds
Typical spread (standard deviation): about 60 seconds

Cohen’s d is:

d = (132 − 120) / 60 = 0.20

That’s a small effect. Whether to ship depends on context: does a small time increase help activation? Is there a trade‑off with completion rate? Effect size helps you weigh that.

How to compute it (without getting lost in math)

For two independent groups, Cohen’s d uses a pooled standard deviation (a blended estimate of spread):

s_p = sqrt( ((n1−1)·s1^2 + (n2−1)·s2^2) / (n1 + n2 − 2) )

d = (mean2 − mean1) / s_p

Practical guidance:

If the group spreads (s1, s2) look similar, using their pooled (or even average) SD is fine for intuition.
If spreads are very different, consider Glass’s Δ: divide by the control group’s SD.
For before/after with the same people, use a paired version (standardize by the SD of the differences).

What the numbers mean (use context!)

Common conventions for absolute values of d:

0.2 ≈ small (noticeable with lots of users)
0.5 ≈ medium (clear, meaningful)
0.8 ≈ large (substantial, obvious)

These are rules of thumb. A “small” effect can be huge at scale (conversion rate), while a “large” effect on a noisy vanity metric might be irrelevant. Always describe the practical impact in your own units (e.g., seconds saved, dollars per user, % activated).

What to report

The effect size (d) and which group is “better” (sign matters)
A 95% confidence interval around d (how precise you are)
The practical translation in your units (“about 12 seconds slower per user”)
Any assumptions (independent groups, similar spreads; or say if you used Glass’s Δ or a paired approach)

Confidence tells you how sure you are; effect size tells you how big it is. Report both.

When not to use Cohen’s d

Binary outcomes (converted vs. not): use risk difference, risk ratio, or odds ratio instead
Very different spreads between groups: consider Glass’s Δ or Welch‑based approaches
Same people measured twice: use a paired effect size

You can try interactive tools here:

For the curious (optional details)

Small samples tend to make d a bit optimistic. A common correction is Hedges’ g, which slightly shrinks d toward zero. If you’re publishing or samples are small, report g as well.
If you already have a t‑test: d ≈ t × sqrt(1/n1 + 1/n2) for independent groups (a quick back‑of‑the‑envelope link between significance and size).

TL;DR

Effect size answers “how large is the difference?”—the decision‑making piece p‑values can’t give you
Cohen’s d standardizes a mean difference using a typical spread (standard deviation)
Use context and confidence intervals; choose alternatives for binary, paired, or highly unequal‑spread cases