InterviewBiz LogoInterviewBiz
← Back
How Do You Design and Interpret an A/B Test?
data-sciencemedium

How Do You Design and Interpret an A/B Test?

MediumHotMajor: data sciencemeta, booking

Concept

A/B testing is a controlled experiment methodology used to compare two or more variants of a product, algorithm, or user experience to identify which performs better based on a predefined business or behavioral metric.
It’s the foundation of data-driven decision making, widely used in product optimization, marketing, and web analytics.

At its core, A/B testing isolates a single change — such as a new button color, pricing model, or recommendation algorithm — and measures its impact relative to a baseline.


1. Core Principles

A well-designed A/B test is built upon three fundamental principles:

  1. Randomization – Users (or units) are randomly assigned to control (A) and treatment (B) groups to eliminate selection bias.
  2. Isolation of Variables – Only one element should differ between variants to attribute changes to that variable.
  3. Statistical Rigor – Hypothesis testing is used to determine whether observed differences are significant or due to random chance.

2. Experimental Design Steps

Step 1: Define Hypothesis

Formulate a measurable and testable statement.


H₀ (Null Hypothesis): Variant B has no effect compared to A.
H₁ (Alternative Hypothesis): Variant B performs better than A.

Example:

“Changing the checkout button color from blue to green increases conversion rate.”

Step 2: Identify Success Metric

Choose a primary metric aligned with the experiment’s goal (e.g., CTR, conversion rate, revenue per session).
Optionally track secondary metrics (e.g., bounce rate) to detect unintended effects.

Step 3: Random Assignment

Randomly assign users into groups:

  • Control (A): Existing experience.
  • Treatment (B): Modified experience.

This ensures comparable distributions of demographics and behaviors across groups.

Step 4: Determine Sample Size and Duration

Use power analysis to estimate the minimum number of observations required.
This depends on:

  • Expected effect size
  • Desired statistical power (commonly 0.8)
  • Significance level (typically 0.05)

Avoid stopping early — premature termination increases false positives.

Step 5: Run the Experiment

Deploy both versions concurrently under similar conditions.
Ensure consistent logging and event tracking.

Step 6: Statistical Evaluation

Once sufficient data is collected, use appropriate tests:

  • Two-proportion z-test – For binary outcomes (conversion rate).
  • t-test – For continuous outcomes (time on page, revenue).
  • Chi-square test – For categorical comparisons.

3. Mathematical Foundation (MDX-safe)


z = (pB - pA) / sqrt( p*(1 - p) * (1/nA + 1/nB) )

Where:

  • pA, pB are conversion rates of A and B.
  • p is the pooled conversion rate.
  • nA, nB are sample sizes of each group.

If |z| > 1.96, reject the null hypothesis at 95% confidence.

p-value < 0.05 indicates a statistically significant difference.
Complement this with confidence intervals (CIs) to understand the plausible range of improvement.


4. Example

Suppose Meta runs an A/B test on ad placement:

  • Control (A): Ads appear at the sidebar.
  • Treatment (B): Ads appear inline within the news feed.

Results:

  • CTR(A) = 1.2%, CTR(B) = 1.4%, p-value = 0.02 → statistically significant improvement.
    However, deeper analysis shows:
  • Revenue per user increased only marginally.
  • Time spent on site decreased slightly.

Interpretation: Despite statistical significance, the change may lack practical significance — a key nuance interviewers look for.


5. Common Pitfalls

PitfallDescriptionPrevention
Early StoppingEnding experiment when results seem favorable.Predefine sample size and stopping criteria.
Peeking BiasRepeatedly checking results inflates false positives.Use sequential testing corrections (e.g., Bonferroni, alpha spending).
Metric ContaminationSecondary effects on unrelated metrics misinterpreted.Monitor multiple KPIs and investigate trade-offs.
Uneven Traffic AllocationDisproportionate traffic across groups biases results.Use random assignment and ensure consistent exposure.
External EventsMarketing campaigns or seasonality distort results.Control timing and external factors where possible.

6. Advanced Topics

  • Multi-armed Bandits: Adaptively allocate traffic to better-performing variants in real time.
  • Sequential Analysis: Allows continuous monitoring without inflating false positives.
  • CUPED (Controlled Pre-Experiment Data): Reduces variance by adjusting for pre-test covariates.
  • Bayesian A/B Testing: Estimates posterior probability that one variant outperforms another, offering intuitive interpretation.

7. Real-World Case Study

Booking.com Experimentation Culture

Booking.com runs over 1,000 A/B tests simultaneously.
They emphasize that most experiments fail — but learning is the true output.
Bayesian and frequentist frameworks coexist to enable rapid, data-driven iteration while controlling risk.


Tips for Application

  • When to discuss:
    When describing experimentation design, product analytics, or decision-making frameworks.

  • Interview Tip:
    Demonstrate depth:

    “We achieved a 6.2% lift in conversions after implementing an A/B test with CUPED adjustment. Although p-value was 0.03, confidence interval overlapped 0.01, so we rolled out cautiously.”


Key takeaway:
A/B testing is not merely about statistical significance — it’s about designing controlled, unbiased experiments that drive actionable business insights, balancing rigor with practical interpretation.