Connor Holly

← AI Skills

Experiment Operations

Productivity

experimentsproductvalidation

What it does

Provides a disciplined framework for running product experiments (A/B tests, feature rollouts, pricing changes) that prevents the most common failure mode: redefining success after seeing the data.

The pattern

Pre-lock everything before the experiment starts:

  • Hypothesis. One sentence: "Changing X will cause Y because Z." If you cannot state the because, you are exploring, not experimenting.
  • Primary metric. Exactly one. Not two, not "primary and secondary." One number that determines success or failure.
  • Guardrail metrics. Maximum three. These are metrics that must not degrade. If your experiment boosts signups but crashes retention, the guardrails catch it.
  • Segment. Who sees the experiment? New users only? All users? A specific geography or plan tier?
  • Time window. How long does the experiment run? Set this before launch. Do not extend it because the results are "almost significant."
  • Decision rule. Written in advance: "If primary metric improves by 5%+ with 95% confidence, ship it. If guardrail metrics degrade by more than 2%, stop it. Otherwise, do not ship."

Daily monitoring:

Check guardrail metrics every day. Set alert thresholds. A guardrail breach does not wait for the time window to expire. It triggers an immediate pause and investigation.

Three outcomes, no ambiguity:

  1. Guardrail breach. Pause the experiment immediately. Investigate the cause. If it is a real degradation, kill the experiment. If it is a data issue, fix the data and restart.
  2. Flat primary metric. The experiment did not work. Stop it. Do not extend the time window hoping for a signal. Write a brief decision memo explaining what you expected and what happened.
  3. Positive lift on primary metric, guardrails intact. Ship it. Write a decision memo with the evidence.

Decision memo. Every experiment ends with a short document: hypothesis, result, data, decision, and lessons learned. This prevents re-running the same experiment six months later because nobody remembers the outcome.

Key decisions

One primary metric, enforced. Multiple primary metrics let you cherry-pick the one that looks good. One metric forces clarity about what you are actually optimizing for.

Pre-committed decision rules. Writing the decision rule before seeing data removes the temptation to move goalposts. "We said 5% lift. We got 3%. We do not ship." This is the hardest discipline to maintain and the most valuable.

Guardrails over long observation. It is better to catch a problem on day 2 than to discover it in a post-mortem on day 14. Daily guardrail checks are non-negotiable.

Kill experiments that do not show signal. Flat results are a result. They mean the intervention does not matter for this metric. Move on.

When to use it

Before any product change where you have a measurable hypothesis and can run a controlled test. Pricing changes, onboarding flow modifications, feature launches with holdout groups, UI redesigns. Not needed for bug fixes, infrastructure changes, or changes where you have no reasonable baseline to compare against.