How to Read A/B Test Results: A Visual Guide

4 min read
Share:
a-b-testing
statistics
product

The TL;DR

  • Primary metric: the one change is meant to move (e.g., conversion rate).
  • Lagging metric: value follows later (e.g., spend per user, ARPU).
  • Guardrail metrics: must not degrade (e.g., crash rate, latency, unsubscribe rate).
  • Stat sig (statistical significance): p-value < α (commonly 0.05).
  • Gray confidence intervals: difference includes 0 → not significant yet.
  • Decide: ship if primary improves, guardrails safe, and effect size is meaningful; otherwise iterate.

A fake A/B results report you can play with

Below is a results-only report for a single test. It shows all metrics from our business case: a primary, a lagging metric, and two guardrails.

95% CI
α = 5.0%
MetricTypeGroup A (CI)Group B (CI)Difference (CI)Rel. ChangeP-ValueResult
Adoption Rateproportion20.00% (18.89% – 21.11%)22.50% (21.34% – 23.66%)2.50% (0.90% – 4.10%)
12.50%
0.0022
B higher
Significant
Spend per Usercontinuous25.00 (24.78 – 25.22)28.00 (27.78 – 28.22)3.00 (2.69 – 3.31)
12.00%
0.0000
B higher
Significant
Crash Rateproportion1.00% (0.72% – 1.28%)1.40% (1.07% – 1.73%)0.40% (-0.03% – 0.83%)
40.00%
0.0662
Not Significant
Page Load Time (s)continuous1.90 (1.89 – 1.91)2.10 (2.09 – 2.11)0.20 (0.18 – 0.22)
10.53%
0.0000
B higher
Significant
Values include 95% confidence intervals. Rows are highlighted when the difference excludes 0.
  • Legend:
    • Colored mini-bar: significant; green = B higher (good), red = B lower (bad)
    • Gray mini-bar: not significant; CI includes 0
    • Difference (CI): B − A with confidence interval

What each section means

Primary, lagging, and guardrail metrics

  • Primary: the decision-maker. If this doesn’t meaningfully improve, you likely don’t ship.
  • Lagging: downstream value. Often noisier or slower to move (e.g., spend per user, LTV). Treat signals here as directional unless you designed power for it.
  • Guardrails: safety checks to prevent harm (e.g., error rate, latency, churn). You don’t ship if these degrade materially—even if the primary improves.

Reading the table: row by row

We’ll focus on the primary metric only and read one row deeply.

95% CI
α = 5.0%
MetricTypeGroup A (CI)Group B (CI)Difference (CI)Rel. ChangeP-ValueResult
Adoption Rateproportion20.00% (18.89% – 21.11%)22.50% (21.34% – 23.66%)2.50% (0.90% – 4.10%)
12.50%
0.0022
B higher
Significant
Values include 95% confidence intervals. Rows are highlighted when the difference excludes 0.
  • Group A / Group B (with CI): shows the point estimate and uncertainty. With 5,000 users per group and 1,000 vs 1,125 successes, A ≈ 20.0% and B ≈ 22.5% with tight intervals due to large n.
  • Difference (CI): B − A ≈ +2.5 percentage points. If its CI excludes 0, the result is significant at the chosen confidence level.
  • Rel. Change + mini CI bar: ≈ +12.5% relative lift; the mini bar visualizes the uncertainty around that relative lift. A colored bar indicates significance; gray means not yet significant.
  • P-Value: probability of observing an effect this extreme if there is truly no difference. If p < 0.05 (at 95% confidence), we call it statistically significant. Combine this with the CI to judge direction and magnitude.

Decision framework (use this in real life)

  • Meaningful = clears your pre-registered threshold (MDE, revenue delta, latency budget, etc.).
  • If it’s trending (gray CI but promising), extend duration or increase sample size before deciding.

What gray confidence intervals mean (and why you see them often)

  • Gray = your difference CI includes 0 at the current sample size; the evidence isn’t sufficient yet to rule out “no effect.”
  • This often happens early in a test or with noisy metrics (like spend per user). It’s a prompt to collect more data, not proof of no effect.
  • If the CI is very wide, you’re underpowered; either increase sample size or accept a higher MDE.

Practical checklist

  • Before you start: define primary, lagging, and guardrail metrics; set α, power, and MDE.
  • During: don’t peek-stop; watch guardrails; ensure even traffic and clean implementation.
  • After: read intervals first, then p-values; check direction and business meaning; segment if relevant; document learnings.

Glossary in one minute

  • Statistical significance (stat sig): result unlikely under the null; often p < 0.05.
  • Confidence interval (CI): plausible range for the true effect. If the difference CI excludes 0, we call it significant at that level.
  • Power: probability to detect a true effect of a given size. Common target: 80%.
  • MDE: smallest effect you care to reliably detect.
  • Guardrail: metric that must not regress beyond a tolerated threshold.