Experimentation

A/B Testing Sample Size in Python: Power, MDE, and Guardrail Metrics

This page is the focused companion to the broader A/B test engineering guide. It targets the practical question teams ask before launch: how much traffic does this experiment need?

Contents
  1. What sample size means
  2. Python calculator
  3. Planning checklist
  4. Guardrail metrics
  5. FAQ

What sample size means

A/B testing sample size is the number of experimental units needed to detect a practical effect with a chosen false-positive rate and power. For conversion metrics, the planning inputs are usually baseline conversion rate, minimum detectable effect, significance level, power, and allocation ratio.

Sample size is not only a statistics question. It is a product decision: what effect is worth detecting, how long can the experiment run, which users count as units, and which guardrails must not move in the wrong direction?

Python sample size calculator

The code below uses the Python standard library and a normal approximation for a two-sample proportion test. It is a planning estimate, not a substitute for the exact analysis plan your team will use.

from math import ceil, sqrt
from statistics import NormalDist


def ab_sample_size_proportions(
    baseline_rate: float,
    minimum_detectable_effect: float,
    alpha: float = 0.05,
    power: float = 0.8,
) -> int:
    """Return required sample size per variant for a two-sided test."""
    if not 0 < baseline_rate < 1:
        raise ValueError("baseline_rate must be between 0 and 1")
    if minimum_detectable_effect <= 0:
        raise ValueError("minimum_detectable_effect must be positive")
    if not 0 < alpha < 1 or not 0 < power < 1:
        raise ValueError("alpha and power must be between 0 and 1")

    p1 = baseline_rate
    p2 = baseline_rate + minimum_detectable_effect
    if not 0 < p2 < 1:
        raise ValueError("baseline plus effect must be between 0 and 1")

    z_alpha = NormalDist().inv_cdf(1 - alpha / 2)
    z_power = NormalDist().inv_cdf(power)
    pooled = (p1 + p2) / 2
    numerator = (
        z_alpha * sqrt(2 * pooled * (1 - pooled))
        + z_power * sqrt(p1 * (1 - p1) + p2 * (1 - p2))
    ) ** 2
    return ceil(numerator / (p2 - p1) ** 2)


print(ab_sample_size_proportions(0.10, 0.01, power=0.8))

Planning checklist

DecisionWhy it mattersCommon mistake
Unit of analysisUser, account, session, or order changes the denominator.Counting sessions when randomization happened by user.
MDEDefines the smallest effect worth waiting for.Choosing an effect that is statistically detectable but commercially irrelevant.
PowerControls the chance of detecting the planned effect.Using a default without checking traffic constraints.
Run durationNeeds full weekly cycles and enough exposure.Stopping as soon as a p-value crosses a threshold.

Guardrail metrics

Guardrails are metrics that should not get worse while the primary metric improves. For AI and data products, useful guardrails may include latency, cost, complaint rate, refund rate, support tickets, model failure rate, or downstream data quality.

If a guardrail is important enough to block a launch, plan its sensitivity before the experiment starts. Otherwise the team may discover too late that the test can detect upside but cannot detect harm.

For adjacent experimentation architecture, read A/B Test Engineering Guide and Uplift Modeling in Industry.

FAQ

What inputs do you need for A/B testing sample size?

You need the baseline rate or variance, the minimum detectable effect, the significance level, desired power, allocation ratio, and the unit of analysis.

What is MDE in an A/B test?

MDE means minimum detectable effect. It is the smallest effect size the experiment is designed to reliably detect at a chosen significance level and power.

Should guardrail metrics affect sample size planning?

Yes. Guardrail metrics can require longer runs or larger samples if the team needs enough sensitivity to detect product, quality, latency, or revenue harm.