How to Read A/B Test Results: A Visual Guide

The TL;DR

Primary metric: the one change is meant to move (e.g., conversion rate).
Lagging metric: value follows later (e.g., spend per user, ARPU).
Guardrail metrics: must not degrade (e.g., crash rate, latency, unsubscribe rate).
Stat sig (statistical significance): p-value < α (commonly 0.05).
Gray confidence intervals: difference includes 0 → not significant yet.
Decide: ship if primary improves, guardrails safe, and effect size is meaningful; otherwise iterate.

A fake A/B results report you can play with

Below is a results-only report for a single test. It shows all metrics from our business case: a primary, a lagging metric, and two guardrails.

95% CI

α = 5.0%

Values include 95% confidence intervals. Rows are highlighted when the difference excludes 0.
Metric	Type	Group A (CI)	Group B (CI)	Difference (CI)	Rel. Change	P-Value	Result
Adoption Rate	proportion	20.00% (18.89% – 21.11%)	22.50% (21.34% – 23.66%)	2.50% (0.90% – 4.10%)	12.50%	0.0022	B higher Significant
Spend per User	continuous	25.00 (24.78 – 25.22)	28.00 (27.78 – 28.22)	3.00 (2.69 – 3.31)	12.00%	0.0000	B higher Significant
Crash Rate	proportion	1.00% (0.72% – 1.28%)	1.40% (1.07% – 1.73%)	0.40% (-0.03% – 0.83%)	40.00%	0.0662	Not Significant
Page Load Time (s)	continuous	1.90 (1.89 – 1.91)	2.10 (2.09 – 2.11)	0.20 (0.18 – 0.22)	10.53%	0.0000	B higher Significant

Legend:
- Colored mini-bar: significant; green = B higher (good), red = B lower (bad)
- Gray mini-bar: not significant; CI includes 0
- Difference (CI): B − A with confidence interval

What each section means

Primary, lagging, and guardrail metrics

Primary: the decision-maker. If this doesn’t meaningfully improve, you likely don’t ship.
Lagging: downstream value. Often noisier or slower to move (e.g., spend per user, LTV). Treat signals here as directional unless you designed power for it.
Guardrails: safety checks to prevent harm (e.g., error rate, latency, churn). You don’t ship if these degrade materially—even if the primary improves.

Reading the table: row by row

We’ll focus on the primary metric only and read one row deeply.

95% CI

α = 5.0%

Values include 95% confidence intervals. Rows are highlighted when the difference excludes 0.
Metric	Type	Group A (CI)	Group B (CI)	Difference (CI)	Rel. Change	P-Value	Result
Adoption Rate	proportion	20.00% (18.89% – 21.11%)	22.50% (21.34% – 23.66%)	2.50% (0.90% – 4.10%)	12.50%	0.0022	B higher Significant

Group A / Group B (with CI): shows the point estimate and uncertainty. With 5,000 users per group and 1,000 vs 1,125 successes, A ≈ 20.0% and B ≈ 22.5% with tight intervals due to large n.
Difference (CI): B − A ≈ +2.5 percentage points. If its CI excludes 0, the result is significant at the chosen confidence level.
Rel. Change + mini CI bar: ≈ +12.5% relative lift; the mini bar visualizes the uncertainty around that relative lift. A colored bar indicates significance; gray means not yet significant.
P-Value: probability of observing an effect this extreme if there is truly no difference. If p < 0.05 (at 95% confidence), we call it statistically significant. Combine this with the CI to judge direction and magnitude.

Decision framework (use this in real life)

Meaningful = clears your pre-registered threshold (MDE, revenue delta, latency budget, etc.).
If it’s trending (gray CI but promising), extend duration or increase sample size before deciding.

What gray confidence intervals mean (and why you see them often)

Gray = your difference CI includes 0 at the current sample size; the evidence isn’t sufficient yet to rule out “no effect.”
This often happens early in a test or with noisy metrics (like spend per user). It’s a prompt to collect more data, not proof of no effect.
If the CI is very wide, you’re underpowered; either increase sample size or accept a higher MDE.

Practical checklist

Before you start: define primary, lagging, and guardrail metrics; set α, power, and MDE.
During: don’t peek-stop; watch guardrails; ensure even traffic and clean implementation.
After: read intervals first, then p-values; check direction and business meaning; segment if relevant; document learnings.

Glossary in one minute

Statistical significance (stat sig): result unlikely under the null; often p < 0.05.
Confidence interval (CI): plausible range for the true effect. If the difference CI excludes 0, we call it significant at that level.
Power: probability to detect a true effect of a given size. Common target: 80%.
MDE: smallest effect you care to reliably detect.
Guardrail: metric that must not regress beyond a tolerated threshold.

If your primary improves, guardrails hold, and the effect clears your minimum meaningful threshold, you can confidently ship—and then continue monitoring in production.