How to Read A/B Test Results: A Visual Guide
4 min read
Share:
a-b-testing
statistics
product
The TL;DR
- Primary metric: the one change is meant to move (e.g., conversion rate).
- Lagging metric: value follows later (e.g., spend per user, ARPU).
- Guardrail metrics: must not degrade (e.g., crash rate, latency, unsubscribe rate).
- Stat sig (statistical significance): p-value < α (commonly 0.05).
- Gray confidence intervals: difference includes 0 → not significant yet.
- Decide: ship if primary improves, guardrails safe, and effect size is meaningful; otherwise iterate.
A fake A/B results report you can play with
Below is a results-only report for a single test. It shows all metrics from our business case: a primary, a lagging metric, and two guardrails.
95% CI
α = 5.0%
Metric | Type | Group A (CI) | Group B (CI) | Difference (CI) | Rel. Change | P-Value | Result |
---|---|---|---|---|---|---|---|
Adoption Rate | proportion | 20.00% (18.89% – 21.11%) | 22.50% (21.34% – 23.66%) | 2.50% (0.90% – 4.10%) | 12.50% | 0.0022 | B higher Significant |
Spend per User | continuous | 25.00 (24.78 – 25.22) | 28.00 (27.78 – 28.22) | 3.00 (2.69 – 3.31) | 12.00% | 0.0000 | B higher Significant |
Crash Rate | proportion | 1.00% (0.72% – 1.28%) | 1.40% (1.07% – 1.73%) | 0.40% (-0.03% – 0.83%) | 40.00% | 0.0662 | Not Significant |
Page Load Time (s) | continuous | 1.90 (1.89 – 1.91) | 2.10 (2.09 – 2.11) | 0.20 (0.18 – 0.22) | 10.53% | 0.0000 | B higher Significant |
- Legend:
- Colored mini-bar: significant; green = B higher (good), red = B lower (bad)
- Gray mini-bar: not significant; CI includes 0
- Difference (CI): B − A with confidence interval
What each section means
Primary, lagging, and guardrail metrics
- Primary: the decision-maker. If this doesn’t meaningfully improve, you likely don’t ship.
- Lagging: downstream value. Often noisier or slower to move (e.g., spend per user, LTV). Treat signals here as directional unless you designed power for it.
- Guardrails: safety checks to prevent harm (e.g., error rate, latency, churn). You don’t ship if these degrade materially—even if the primary improves.
Reading the table: row by row
We’ll focus on the primary metric only and read one row deeply.
95% CI
α = 5.0%
Metric | Type | Group A (CI) | Group B (CI) | Difference (CI) | Rel. Change | P-Value | Result |
---|---|---|---|---|---|---|---|
Adoption Rate | proportion | 20.00% (18.89% – 21.11%) | 22.50% (21.34% – 23.66%) | 2.50% (0.90% – 4.10%) | 12.50% | 0.0022 | B higher Significant |
- Group A / Group B (with CI): shows the point estimate and uncertainty. With 5,000 users per group and 1,000 vs 1,125 successes, A ≈ 20.0% and B ≈ 22.5% with tight intervals due to large n.
- Difference (CI): B − A ≈ +2.5 percentage points. If its CI excludes 0, the result is significant at the chosen confidence level.
- Rel. Change + mini CI bar: ≈ +12.5% relative lift; the mini bar visualizes the uncertainty around that relative lift. A colored bar indicates significance; gray means not yet significant.
- P-Value: probability of observing an effect this extreme if there is truly no difference. If p < 0.05 (at 95% confidence), we call it statistically significant. Combine this with the CI to judge direction and magnitude.
Decision framework (use this in real life)
- Meaningful = clears your pre-registered threshold (MDE, revenue delta, latency budget, etc.).
- If it’s trending (gray CI but promising), extend duration or increase sample size before deciding.
What gray confidence intervals mean (and why you see them often)
- Gray = your difference CI includes 0 at the current sample size; the evidence isn’t sufficient yet to rule out “no effect.”
- This often happens early in a test or with noisy metrics (like spend per user). It’s a prompt to collect more data, not proof of no effect.
- If the CI is very wide, you’re underpowered; either increase sample size or accept a higher MDE.
Practical checklist
- Before you start: define primary, lagging, and guardrail metrics; set α, power, and MDE.
- During: don’t peek-stop; watch guardrails; ensure even traffic and clean implementation.
- After: read intervals first, then p-values; check direction and business meaning; segment if relevant; document learnings.
Glossary in one minute
- Statistical significance (stat sig): result unlikely under the null; often p < 0.05.
- Confidence interval (CI): plausible range for the true effect. If the difference CI excludes 0, we call it significant at that level.
- Power: probability to detect a true effect of a given size. Common target: 80%.
- MDE: smallest effect you care to reliably detect.
- Guardrail: metric that must not regress beyond a tolerated threshold.
If your primary improves, guardrails hold, and the effect clears your minimum meaningful threshold, you can confidently ship—and then continue monitoring in production.