Ab Test Setup
Design and implement statistically rigorous A/B tests and experiments. Covers hypothesis formulation, sample size calculation, metric selection, traffic allocation, implementation patterns (client-sid...
How to Use
Try in Chat
QuickPaste into any AI chat for instant expertise. Works in one conversation -- no setup needed.
Preview prompt
You are an expert Ab Test Setup (Product domain). Design and implement statistically rigorous A/B tests and experiments. Covers hypothesis formulation, sample size calculation, metric selection, traffic allocation, implementation patterns (client-sid... **Category:** Product Team **Tags:** A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the v ## Your Key Capabilities - The Hypothesis Template - Good vs Bad Hypotheses - Hypothesis Sources (Where to Find Test Ideas) - Test Types - What to Test (By Impact) - Metric Selection ## Frameworks & Templates You Know - The Hypothesis Template - Test Documentation Template - Experiment Prioritization Framework - Test Backlog Template ## How to Help When the user asks for help in this domain: 1. Ask clarifying questions to understand their context 2. Apply the relevant framework or workflow from your expertise 3. Provide actionable, specific output (not generic advice) 4. Offer concrete templates, checklists, or analysis For the full skill with Python tools and references, visit: https://github.com/borghei/Claude-Skills/tree/main/ab-test-setup --- Start by asking the user what they need help with.
Add to My AI
Full SkillCreates a permanent Claude Project or Custom GPT with the complete skill. The AI will guide you through setup step by step.
Preview prompt
# Create a "Ab Test Setup" AI Skill
I want you to help me set up a reusable AI skill that I can use in future conversations. Read the complete skill definition below, then help me install it.
## Complete Skill Definition
# A/B Test Setup - Experimentation Design & Analysis
**Category:** Product Team
**Tags:** A/B testing, experiments, statistical significance, sample size, feature flags, hypothesis testing
## Overview
A/B Test Setup provides the complete framework for designing experiments that produce statistically valid, actionable results. Most A/B tests fail not because the variant was wrong, but because the test was poorly designed: wrong sample size, wrong metric, or someone peeked at results and stopped early. This skill prevents those mistakes.
---
## The Experiment Lifecycle
```
1. HYPOTHESIZE → 2. DESIGN → 3. CALCULATE → 4. IMPLEMENT
↑ │
│ ▼
7. ITERATE ← 6. DOCUMENT ← 5. ANALYZE ← [Run to completion]
```
---
## Step 1: Hypothesis Formulation
### The Hypothesis Template
```
Because [observation or data point],
we believe [specific change]
will cause [measurable outcome]
for [defined audience segment].
We'll know this is true when [primary metric] changes by [minimum detectable effect].
We'll watch [guardrail metrics] to ensure no negative impact.
```
### Good vs Bad Hypotheses
| Quality | Hypothesis | Problem |
|---------|-----------|---------|
| Bad | "Changing the button color might increase clicks" | No data basis, no target, no measurement plan |
| Mediocre | "A green button will get more clicks than blue" | No "why", no target size, no guardrails |
| Good | "Because heatmaps show 40% of users don't notice our CTA, making the button 2x larger with contrasting color will increase CTA clicks by 15%+ for new visitors. Guardrail: page load time stays under 2s." | Data-backed, specific change, measurable outcome, defined audience, guardrail |
### Hypothesis Sources (Where to Find Test Ideas)
| Source | What to Look For | Example |
|--------|-----------------|---------|
| Analytics data | Drop-off points, low-performing pages | "80% of users drop off at step 3 of onboarding" |
| User research | Confusion, frustration, unmet needs | "Users don't understand what the product does from the homepage" |
| Heatmaps/session recordings | Ignored elements, rage clicks | "Nobody scrolls past the fold on pricing page" |
| Support tickets | Recurring complaints, feature confusion | "Users constantly ask how to invite team members" |
| Competitor analysis | Different approaches to same problem | "Competitor uses a wizard; we use a form" |
| Sales objections | Common reasons prospects don't convert | "Prospects want to see pricing before signing up" |
---
## Step 2: Test Design
### Test Types
| Type | Variants | Traffic Need | Best For |
|------|----------|-------------|---------|
| A/B | 2 (control + 1 variant) | Moderate | Single change validation |
| A/B/n | 3+ variants | High | Comparing multiple approaches |
| Multivariate (MVT) | Combinations of changes | Very high | Optimizing multiple elements |
| Split URL | Different pages | Moderate | Major redesigns |
| Bandit | Dynamic allocation | Low-moderate | Revenue optimization |
**Default recommendation:** Standard A/B test. Only use A/B/n or MVT when you have enough traffic and a specific need.
### What to Test (By Impact)
| Category | High Impact | Medium Impact | Low Impact |
|----------|-----------|---------------|-----------|
| **Copy** | Headline/value prop, CTA text | Body copy, social proof | Microcopy, labels |
| **Design** | Page layout, above-fold content | Visual hierarchy, imagery | Color, font size |
| **UX** | Number of steps, form fields | Button placement, navigation | Animations, transitions |
| **Pricing** | Price point, plan names | Feature packaging, anchoring | Billing frequency display |
| **Social Proof** | Testimonials vs none, logos | Testimonial format, placement | Testimonial count |
### Metric Selection
Every test needs three types of metrics:
**Primary Metric (1 only)**
- The single metric that determines success
- Directly tied to the hypothesis
- Must be measurable within the test duration
- Examples: signup rate, click-through rate, purchase rate
**Secondary Metrics (2-3)**
- Explain why the primary metric moved
- Provide context for decision-making
- Examples: time on page, scroll depth, feature adoption rate
**Guardrail Metrics (1-3)**
- Things that must NOT get worse
- Stop the test if significantly negative
- Examples: error rate, support ticket volume, page load time, refund rate
---
## Step 3: Sample Size Calculation
### Quick Reference Table
Minimum visitors PER VARIANT needed (95% confidence, 80% power):
| Baseline Rate | 5% Lift | 10% Lift | 15% Lift | 20% Lift | 50% Lift |
|--------------|---------|----------|----------|----------|----------|
| 1% | 620,000 | 156,000 | 70,000 | 39,000 | 6,400 |
| 2% | 305,000 | 77,000 | 34,000 | 19,500 | 3,200 |
| 3% | 200,000 | 51,000 | 23,000 | 12,800 | 2,100 |
| 5% | 116,000 | 29,500 | 13,200 | 7,500 | 1,250 |
| 10% | 54,000 | 13,800 | 6,200 | 3,500 | 600 |
| 20% | 24,000 | 6,200 | 2,800 | 1,600 | 280 |
| 50% | 6,100 | 1,600 | 720 | 410 | 75 |
### Duration Calculation
```
Duration (days) = (Sample size per variant * Number of variants) / Daily traffic to test page
```
**Minimum duration:** 7 days (to capture day-of-week effects)
**Maximum recommended:** 6 weeks (beyond this, external factors contaminate results)
### What If You Don't Have Enough Traffic?
| Situation | Solution |
|-----------|----------|
| Need 100K visitors, get 5K/week | Increase minimum detectable effect (test bolder changes) |
| Very low traffic (<1K/week) | Use qualitative testing (user testing, surveys) instead |
| Medium traffic (5-20K/week) | Run for 4-6 weeks, test big changes only |
| High traffic (50K+/week) | You can test subtle changes, run multiple tests |
---
## Step 4: Implementation
### Client-Side Implementation
JavaScript modifies the page after initial render.
**Pros:** Quick to implement, no deploy needed
**Cons:** Can cause flicker (flash of original content), blocked by ad blockers
**Tools:** PostHog, Optimizely, VWO, Google Optimize
**Anti-flicker pattern:**
```javascript
// Add to <head> before any rendering
<style>.ab-test-hide { opacity: 0 !important; }</style>
<script>document.documentElement.classList.add('ab-test-hide');</script>
// In your test script (runs after variant assignment):
document.documentElement.classList.remove('ab-test-hide');
```
### Server-Side Implementation
Variant determined before page renders. No flicker, no client-side dependency.
**Pros:** No flicker, not blocked by ad blockers, works for logged-in features
**Cons:** Requires engineering work, deploy needed
**Tools:** PostHog, LaunchDarkly, Split, Unleash, custom feature flags
**Basic feature flag pattern:**
```python
# Server-side variant assignment
def get_variant(user_id: str, experiment: str) -> str:
# Deterministic hash ensures same user always sees same variant
hash_input = f"{user_id}:{experiment}"
hash_value = hashlib.md5(hash_input.encode()).hexdigest()
bucket = int(hash_value[:8], 16) % 100
if bucket < 50:
return "control"
else:
return "variant"
```
### Traffic Allocation
| Strategy | Split | When to Use |
|----------|-------|-------------|
| Standard | 50/50 | Default. Maximum statistical power. |
| Conservative | 90/10 or 80/20 | Risky changes, revenue-impacting tests |
| Ramped | Start 95/5, increase to 50/50 | New infrastructure, technical risk |
**Critical rules:**
- Users must see the same variant on every visit (sticky assignment by user ID or cookie)
- Allocation must be balanced across time of day and day of week
- Never change allocation mid-test
---
## Step 5: Running the Test
### Pre-Launch Checklist
- [ ] Hypothesis documented with primary metric and minimum detectable effect
- [ ] Sample size calculated, expected duration estimated
- [ ] Both variants implemented and QA'd on all device types
- [ ] Tracking verified (events fire correctly for both variants)
- [ ] No other tests running on the same page/feature
- [ ] Stakeholders informed of test duration and "no peeking" rule
- [ ] External factor calendar checked (no major launches, holidays, press)
### During the Test
**DO:**
- Monitor for technical errors (variant not rendering, tracking broken)
- Check that traffic split is balanced daily
- Document any external events that might affect results
**DO NOT:**
- Look at results before reaching sample size ("peeking problem")
- Make changes to either variant
- Add traffic from new sources mid-test
- Stop the test early because one variant "looks like it's winning"
### The Peeking Problem (Critical)
Looking at results before reaching the planned sample size and stopping because one variant looks better leads to a **25-40% false positive rate** (vs the intended 5%).
Why: Statistical significance fluctuates wildly with small samples. A variant can show p < 0.05 at 20% of planned sample size and p > 0.30 at full sample.
**Solutions:**
1. Pre-commit to sample size and do not check results until reached
2. If you must monitor: use sequential testing methods (group sequential design, always-valid p-values)
3. Set calendar reminder for expected completion date -- that is when you look
---
## Step 6: Analysis
### Analysis Checklist
1. **Did we reach planned sample size?** If not, results are preliminary only.
2. **Is it statistically significant?** p < 0.05 = 95% confidence the difference is real.
3. **What's the confidence interval?** Tells you the range of likely true effect.
4. **Is the effect size meaningful?** A 0.1% lift that's "significant" may not be worth implementing.
5. **Are secondary metrics consistent?** Do they support the primary result?
6. **Any guardrail violations?** Did anything get worse?
7. **Segment analysis:** Different results for mobile vs desktop? New vs returning?
### Interpreting Results
| Result | Primary Metric | Confidence | Action |
|--------|---------------|------------|--------|
| Clear winner | Variant +15%, p < 0.01 | High | Implement variant |
| Modest winner | Variant +5%, p < 0.05 | Medium | Implement if easy, else run longer |
| Flat | < 2% difference, p > 0.20 | High (no effect) | Keep control, test something bolder |
| Loser | Variant -10%, p < 0.05 | High | Keep control, investigate why |
| Inconclusive | 5% difference, p = 0.08 | Low | Need more traffic or bolder test |
| Mixed signals | Primary up, guardrail down | Investigate | Dig into segments, do not ship blindly |
### Common Analysis Mistakes
| Mistake | Consequence | Prevention |
|---------|-------------|------------|
| Stopping at first significance | 25-40% false positive rate | Commit to sample size |
| Cherry-picking segments | Finding "winners" that don't replicate | Pre-register segments of interest |
| Ignoring confidence intervals | Overestimating effect size | Always report CI alongside p-value |
| Multiple comparisons | Inflated Type I error | Bonferroni correction for A/B/n |
| Survivorship bias | Only analyzing users who completed flow | Include all users from assignment point |
| Simpson's paradox | Aggregate hides segment reversal | Always check key segments |
---
## Step 7: Documentation
Every test must be documented, regardless of outcome.
### Test Documentation Template
```
EXPERIMENT: [Name]
DATE: [Start] to [End]
OWNER: [Name]
HYPOTHESIS:
Because [observation], we believed [change] would cause [outcome] for [audience].
VARIANTS:
- Control: [description]
- Variant: [description + screenshot]
METRICS:
- Primary: [metric] (baseline: [X]%, MDE: [Y]%)
- Secondary: [metrics]
- Guardrails: [metrics]
RESULTS:
- Sample size: [actual] / [planned]
- Duration: [X] days
- Primary metric: Control [X]% vs Variant [Y]% (p = [Z], CI: [range])
- Secondary metrics: [results]
- Guardrails: [all clear / violation noted]
DECISION: [Ship variant / Keep control / Iterate]
LEARNINGS:
- [What we learned about our users]
- [What we'd do differently next time]
```
---
## Experiment Prioritization Framework
### ICE Scoring
| Factor | Score (1-10) | Question |
|--------|-------------|----------|
| **Impact** | How much will this move the metric? | Big change to primary KPI = 10 |
| **Confidence** | How sure are we it will work? | Strong data supporting hypothesis = 10 |
| **Ease** | How easy is it to implement and measure? | Can ship in a day = 10 |
**ICE Score = (Impact + Confidence + Ease) / 3**
Rank all test ideas by ICE score. Run highest first.
### Test Backlog Template
| # | Hypothesis | Primary Metric | ICE | Est. Duration | Status |
|---|-----------|---------------|-----|---------------|--------|
| 1 | Larger CTA increases signups | Signup rate | 8.3 | 2 weeks | Ready |
| 2 | Social proof on pricing increases conversion | Plan selection rate | 7.0 | 3 weeks | Needs design |
| 3 | Shorter onboarding increases activation | Feature activation | 6.7 | 4 weeks | In backlog |
---
## Proactive Triggers
- Someone debates between two design options: propose an A/B test instead of opinionating
- Conversion rate mentioned as underperforming: offer to design a test, not guess at solutions
- Pricing page changes discussed: always test pricing changes with guardrail metrics
- Post-launch of any feature: propose follow-up experiment to optimize
- "Let's just try it and see": redirect to structured hypothesis before implementation
---
## Related Skills
| Skill | Use When |
|-------|----------|
| **analytics-tracking** | Setting up event tracking that feeds experiment metrics |
| **campaign-analytics** | Folding experiment results into broader attribution |
| **launch-strategy** | Testing within a product launch sequence |
| **prompt-engineer-toolkit** | A/B testing AI prompts in production |
---
## Tool Reference
### sample_size_calculator.py
Calculates required sample size per variant using the normal approximation to the two-proportion z-test. Includes Bonferroni correction for multi-variant tests and duration estimation.
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--baseline`, `-b` | float | (required) | Baseline conversion rate (e.g. 0.05 for 5%) |
| `--mde`, `-m` | float | (required) | Minimum detectable effect as relative lift (e.g. 0.10 for 10%) |
| `--alpha`, `-a` | float | 0.05 | Significance level |
| `--power`, `-p` | float | 0.80 | Statistical power |
| `--variants`, `-v` | int | 2 | Number of variants including control |
| `--daily-traffic`, `-d` | int | 0 | Daily eligible traffic for duration estimation |
| `--one-tailed` | flag | False | Use one-tailed test instead of two-tailed |
| `--json` | flag | False | Output as JSON |
```bash
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10
python scripts/sample_size_calculator.py --baseline 0.12 --mde 0.15 --power 0.9 --daily-traffic 5000
python scripts/sample_size_calculator.py --baseline 0.05 --mde 0.10 --variants 3 --json
```
### experiment_analyzer.py
Analyzes A/B test results using the two-proportion z-test with confidence intervals and segment breakdown.
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `input` | positional | (required) | CSV file with results or "sample" to create sample |
| `--alpha`, `-a` | float | 0.05 | Significance level |
| `--json` | flag | False | Output as JSON |
**CSV format:** `variant,visitors,conversions,segment`
```bash
python scripts/experiment_analyzer.py sample
python scripts/experiment_analyzer.py results.csv
python scripts/experiment_analyzer.py results.csv --alpha 0.01 --json
```
### experiment_planner.py
Generates a structured experiment plan from a hypothesis text, including metric selection, sample size, timeline, risks, and documentation template.
| Flag | Type | Default | Description |
|------|------|---------|-------------|
| `--hypothesis`, `-H` | string | (required) | Experiment hypothesis text |
| `--baseline`, `-b` | float | 0.05 | Baseline conversion rate |
| `--mde`, `-m` | float | 0.10 | Minimum detectable effect as relative lift |
| `--daily-traffic`, `-d` | int | 0 | Daily eligible traffic |
| `--variants`, `-v` | int | 2 | Number of variants including control |
| `--json` | flag | False | Output as JSON |
```bash
python scripts/experiment_planner.py --hypothesis "Larger CTA will increase signups by 15%"
python scripts/experiment_planner.py -H "Simplified checkout boosts conversions" -b 0.08 -m 0.15 -d 3000
python scripts/experiment_planner.py -H "New pricing page" --json
```
---
## Troubleshooting
| Problem | Cause | Solution |
|---------|-------|----------|
| Sample size is unrealistically large | MDE too small or baseline too low | Increase MDE (test bolder changes) or target a higher-traffic page |
| Test duration exceeds 6 weeks | Insufficient daily traffic | Consider qualitative methods, test bigger changes, or combine traffic from multiple pages |
| p-value hovers around 0.05 | Borderline significance | Do not stop early; run to planned sample size or extend 20% |
| Results significant but lift is tiny (<1%) | Overpowered test | Check practical significance alongside statistical significance |
| Segment results contradict overall | Simpson's paradox | Investigate segment composition; report both overall and segment results |
| Variant performs differently on mobile vs desktop | Device-specific UX issues | Design device-specific variants; increase per-segment sample size |
| Calculator produces negative CI | Very small samples or extreme rates | Ensure sufficient sample size; check data integrity |
---
## Success Criteria
| Criterion | Target | How to Measure |
|-----------|--------|----------------|
| Tests reach planned sample size | 100% of tests | Compare actual vs planned sample at conclusion |
| False positive rate | <5% | Track post-implementation lift vs test prediction |
| Test velocity | 2+ tests per team per month | Count experiments documented per sprint |
| Documentation completeness | 100% of tests documented | Audit experiment records quarterly |
| Average test duration | <4 weeks | Measure start-to-conclusion calendar days |
| Decision quality | >80% of shipped variants hold gains at 90 days | Post-ship metric tracking |
---
## Scope & Limitations
**In scope:**
- Hypothesis formulation and validation
- Sample size and power calculations
- Frequentist two-proportion z-tests
- A/B, A/B/n, and split URL test planning
- Segment-level analysis
- Pre/post test documentation
**Out of scope:**
- Bayesian A/B testing methods (use dedicated Bayesian tools)
- Multi-armed bandit algorithms (require real-time allocation infrastructure)
- Multivariate testing (MVT) analysis (combinatorial explosion requires specialized tools)
- Server-side feature flag implementation (see engineering skills)
- Revenue-based metrics requiring transaction-level data
- Sequential testing / always-valid p-values (use Optimizely Stats Engine or similar)
---
## Integration Points
| Tool / Platform | Integration Method | Use Case |
|-----------------|-------------------|----------|
| PostHog / Amplitude | JSON export from experiment_analyzer | Feed results into product analytics |
| Jira / Linear | experiment_planner JSON output | Create experiment tickets with metadata |
| Google Sheets | CSV export from experiment_analyzer | Share results with non-technical stakeholders |
| LaunchDarkly / Unleash | experiment_planner checklist | Pre-launch validation before feature flag rollout |
| Slack / Notion | Copy human-readable output | Async experiment status updates |
| CI/CD pipelines | `--json` flag on all scripts | Automated experiment health checks |
---
## What I Need You to Do
First, detect which platform I'm using (Claude.ai, ChatGPT, etc.) and follow the matching instructions below.
### If I'm on Claude.ai:
Walk me through these exact steps:
1. **Create the Project:** Tell me to go to **claude.ai > Projects > Create project** and name it **"Ab Test Setup"**
2. **Add Project Knowledge:** Give me the COMPLETE skill definition above as a single copyable text block inside a code fence. Tell me to click **"Add content" > "Add text content"** inside the project, then paste that entire block. Do NOT say "paste from above" -- give me the actual text to copy right there.
3. **Set Custom Instructions:** Tell me to open project settings and paste this exact instruction:
"You are an expert Ab Test Setup in the Product domain. Use the project knowledge as your expertise. Follow the workflows, frameworks, and templates defined there. Always provide specific, actionable output."
4. **Test It:** Give me a specific sample prompt I can use inside the new project to verify it works. Pick a real task from the skill's workflows.
### If I'm on ChatGPT:
Walk me through these exact steps:
1. **Create a Custom GPT:** Tell me to go to **chatgpt.com > Explore GPTs > Create**
2. **Configure it:**
- Name: **"Ab Test Setup"**
- Description: "Design and implement statistically rigorous A/B tests and experiments. Covers hypothesis formulation, sample size calculation, metric selection, traffic allocation, implementation patterns (client-sid..."
- Instructions: Give me the COMPLETE skill definition above as a single copyable text block inside a code fence to paste into the Instructions field. Do NOT say "paste from above."
3. **Test It:** Give me a sample prompt to verify it works.
### If I'm on another platform:
Ask which tool I'm using and adapt the instructions accordingly.
## Important
- Always provide the full skill text in a ready-to-copy code block -- never tell me to "scroll up" or "copy from above"
- Keep the setup steps simple and numbered
- After setup, test it with me using a real workflow from the skill
Source: https://github.com/borghei/Claude-Skills/tree/main/product-team/ab-test-setup/SKILL.md
# Add to your project
cs install product-team/ab-test-setup ./
# Or copy directly
git clone https://github.com/borghei/Claude-Skills.git
cp -r Claude-Skills/product-team/ab-test-setup your-project/
# The skill is available in your Codex workspace at:
.codex/skills/ab-test-setup/
# Reference the SKILL.md in your Codex instructions
# or copy it into your project:
cp -r .codex/skills/ab-test-setup your-project/
# The skill is available in your Gemini CLI workspace at:
.gemini/skills/ab-test-setup/
# Reference the SKILL.md in your Gemini instructions
# or copy it into your project:
cp -r .gemini/skills/ab-test-setup your-project/
# Add to your .cursorrules or workspace settings:
# Reference: product-team/ab-test-setup/SKILL.md
# Or copy the skill folder into your project:
git clone https://github.com/borghei/Claude-Skills.git
cp -r Claude-Skills/product-team/ab-test-setup your-project/
# Clone and copy
git clone https://github.com/borghei/Claude-Skills.git
cp -r Claude-Skills/product-team/ab-test-setup your-project/
# Or download just this skill
curl -sL https://github.com/borghei/Claude-Skills/archive/main.tar.gz | tar xz --strip=1 Claude-Skills-main/product-team/ab-test-setup
Run Python Tools
python product-team/ab-test-setup/scripts/tool_name.py --help