Experiment Prompt: When Does TDD Actually Pay Off?¶
Type: Reusable experiment prompt (hand this whole file to Claude to execute)
Harness: scripts/run_tdd_experiment.py — must be extended (see "Extend the harness")
Motivation: FAQ.md Q1, tdd-vs-nontdd-report.md,
3sizes-3arms-report.md
Prior run results: when-tdd-pays-report.md
Supersedes: ambiguous-requirements-experiment.md (Axis A) and
experiment-prompt-design-discovery.md (Axis B), merged here into one factorial.
The prior studies found no advantage for test-first — but they controlled away the two conditions TDD's claimed benefits live in: ambiguous requirements (where a failing test is meant to surface unstated decisions) and open design with real refactoring (where incremental tests are meant to discover a better structure). This experiment crosses both on one task suite and asks the unified question: under what conditions does TDD's claimed value actually appear — and is it largest exactly where the prior experiment was blind?
What the first run found (2026-06-23, claude-sonnet-4-6)¶
The first execution of this experiment (288 cells, 4 tasks, 3 trials per cell) produced four verdicts. They reframe the research questions for the next run:
| Finding | Implication for next run |
|---|---|
| ambiguity hypothesis REJECTED (reversed): test-after 67% EDGE vs tdd-refactor 33% under vague spec | The anchoring effect is real. Add test-after-refactor to test whether deferring test commitment AND refactoring combines both advantages |
| changeability hypothesis CONFIRMED: tdd-refactor lowest blast radius (664 vs 700–770) | Refactoring matters for changeability. The new arm tests whether test-first is necessary to get it |
| mechanism-isolation hypothesis CONFIRMED: tdd-no-refactor (701) ≈ test-after (700) | The refactor step is load-bearing, not test ordering. But the experiment confounds refactoring with its test safety net — test-after-refactor isolates this |
| clarity-interaction hypothesis NOT CONFIRMED: TDD's changeability gap is similar across clarity conditions | The advantage is structural (from refactoring), not spec-dependent |
The critical confound the first run could not resolve: tdd-refactor's changeability advantage might come from (a) refactoring itself, (b) having tests before refactoring that catch regressions, or (c) the iterative TDD cycle shaping the design incrementally. The first run cannot distinguish these because all three are present simultaneously in the tdd-refactor arm.
The predicted best path — untested: test-after-refactor (write code → write tests
against the working implementation → refactor with test safety net) is predicted to:
- Match test-after's EDGE advantage (67%) by deferring test commitment
- Match tdd-refactor's changeability (664 blast radius) by refactoring under test coverage
- Cost less than tdd-refactor ($0.44/stage) since there are no iterative red-green cycles
- Dominate every other arm across all three dimensions simultaneously
This prediction is the primary question for the next run.
Token cost summary from first run (most to least expensive per stage):
| arm | cost/stage | EDGE under vague | blast radius |
|---|---|---|---|
| tdd-refactor | $0.44 | 33% | 664 (best) |
| bduf | $0.24 | 0–67% | 770 |
| tdd-no-refactor | $0.22 | 0% | 701 |
| test-after | $0.19 | 67% (best) | 700 |
| test-after-refactor | untested | predicted: ~67% | predicted: ~664 |
Vague requirements finding (unchanged for next run): The notifier task produced EDGE=0% for every workflow — per-channel retry semantics were not inferrable from the vague spec. This is a spec-gap, not a workflow-gap: no coding methodology recovers information that was never stated. The correct response to a vague spec is a clarifying conversation before any code is written. The next run should preserve a task with an irreducible spec-gap to confirm this floor holds across workflows.
Prompt¶
Run one controlled experiment that crosses requirement clarity with coding workflow on a suite of open-design tasks, and grades each cell on two axes at once: contract inference under ambiguity and changeability of the resulting design. Reuse the isolation/cost primitives in
scripts/run_tdd_experiment.py, but extend it to (a) select a clear or vague spec, (b) grade Stage 0 with a split CORE/EDGE acceptance, (c) apply a withheld change chain, and (d) record changeability, structural, and multi-rater review measurements per stage. Author open-design tasks with a deliberate design trap and both a clear and a vague spec over an identical hidden contract. Work on a feature branch; commit fixtures, the harness extension, raw data, and a report; do not open a PR unless asked.Two factors:
- Clarity —
clear(spec states architecture and edge-case decisions) vsvague(omits both; the hidden acceptance is unchanged, so the contract is fixed — only what the agent is told changes).- Workflow —
tdd-refactor(strict RED-GREEN-REFACTOR, refactor mandatory),tdd-no-refactor(test-first, never restructure),test-after(code first, tests last),test-after-refactor(all production code first, then tests against the working implementation, then refactor with the test safety net in place — see "Arms"),bduf(design up front, then implement, then tests),ship(run the/specs→/plan→/buildpipeline: the/specsphase authors acceptance criteria before any code is written;/planproduces Gherkin scenarios per slice;/buildexecutes RED-GREEN-REFACTOR with inline review).shipruns atvagueonly — its value is in testing whether structured spec synthesis resolves ambiguity better than test-first's failing-test-as-specification; under aclearspec both arms trivially converge and no comparison is possible.Grade every cell the same way: CORE/EDGE Stage-0 acceptance (ambiguity), a withheld change chain (changeability), deterministic radon structural metrics, and a blind multi-rater code+test review score. The unit of inference is the task; compare paired across tasks.
Research questions & hypotheses (pre-register before looking at results)¶
- ambiguity-inference / ambiguity. Does workflow change how well the agent infers unstated
decisions? ambiguity hypothesis: under
vague,tdd-refactorpasses more EDGE assertions thantest-after; underclearthere is no gap (aworkflow × clarityinteraction). Null: vagueness degrades all arms equally (test-first just locks in its own happy-path guess). - changeability / design. Does workflow change the changeability of the design?
changeability hypothesis:
tdd-refactorabsorbs the change chain at lower cumulative cost / smaller blast radius thantest-afterandbduf. mechanism-isolation hypothesis:tdd-no-refactor≈test-after<tdd-refactor⇒ the benefit comes from refactoring, not test ordering. - clarity-interaction / the headline interaction. Is TDD's advantage (on EDGE and
changeability) largest in the
vague + open-designcell — i.e. exactly where the prior null experiment could not look? This is the cell both claims predict TDD should win. - spec-synthesis / spec synthesis vs test-first. Does
ship's explicit acceptance-criteria synthesis (the/specsphase authors the contract before any code is written) resolve ambiguity as well as or better thantdd-refactor's failing-test-as- specification approach? spec-synthesis hypothesis: undervague,shipEDGE pass rate ≥tdd-refactorbecause the/specsphase forces the agent to state every acceptance decision upfront, including those the vague spec omitted. spec-synthesis-changeability hypothesis (mechanism):shipchangeability ≤tdd-refactorbecause its inline review checkpoints in/buildcatch structural issues thattdd-refactor's refactor step misses. Null: the/specsphase makes the same happy-path assumptions as any other arm — synthesising a spec from a vague prompt does not reliably surface EDGE decisions. - test-after-refactor / test-after-refactor dominance. Does deferring test commitment and
adding a refactor phase dominate all existing arms simultaneously? dominance hypothesis: under
vague,test-after-refactorEDGE pass rate ≥test-after(≈67%) — because deferred tests capture the actual emergent contract, including edge cases the working implementation surfaced — AND blast radius ≈tdd-refactor(≈664) — because refactoring under test coverage provides the same structural safety net as TDD's refactor step. If both hold,test-after-refactordominates every existing arm: better EDGE thantdd-refactor, better changeability thantest-after, lower cost thantdd-refactor(no iterative red-green cycles during initial build). Null: refactoring with tests written post-implementation produces the same structure as no-refactor arms — the incremental TDD red-green cycle shapes design in ways a post-facto test-then-refactor pass cannot replicate.
Design: clarity × workflow (fractional factorial, paired by task)¶
| tdd-refactor | tdd-no-refactor | test-after | test-after-refactor | bduf | ship | |
|---|---|---|---|---|---|---|
| clear | ✓ anchor | – | ✓ anchor | ✓ anchor | – | – |
| vague | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Run all 6 arms at vague (the novel regime where both benefits should appear)
and only the 3 anchor arms at clear (to establish the interaction baseline) —
9 arm-clarity cells per task instead of 12. test-after-refactor × clear is
added as an anchor to give test-after-refactor a clean baseline: under a clear spec with no design
trap, how much does the refactor step help relative to test-after (the same code
written without a subsequent refactor)? bduf × vague is deliberately kept: it
tests whether committing to a design before requirements are clear helps or hurts.
ship × vague tests whether committing to an explicit acceptance contract (via
/specs) helps or hurts — a different mechanism from bduf. ship requires a
plugin-enabled $HOME per cell (same setup as the build-pipeline arm in the
prior 3-arm study). Model fixed (e.g. claude-sonnet-4-6), reported.
Shared substrate: open-design tasks (clear + vague over one hidden contract)¶
Author 4–6 tasks. Each needs a genuinely open design space (≥2 viable architectures) and a trap (a naive happy-path build passes Stage 0 but is punished by a later change). For each task:
golden-repo.tar.gz— a stub package (public API surface only).spec_clear.md— states behavior, the public surface, the intended module shape, and the edge-case decisions (empty input, ties, errors, ordering, rounding). ≥8 acceptance scenarios.spec_vague.md— same goal + public surface, but omits both the architecture guidance and every edge-case decision the EDGE acceptance checks. Genuinely buildable, not contradictory — the ambiguity is unstated decisions, not broken requirements.- Hidden acceptance (identical across clarity):
acc_core.py— behavior stated even in the vague spec (happy path).acc_edge.py— the omitted/ambiguous decisions.acc_core ∪ acc_edge= the full Stage-0 contract.acc_change1.py … acc_changeK.py(K = 3–5) — each chain stage's contract, including regression assertions for all prior stages.change1.md … changeK.md— the withheld change chain, each modifying behavior on a different axis the spec didn't foreshadow; design ≥1 change to punish the naive design and reward a decoupled one.- Validate against TWO reference solutions — a naive one and a clean one: both
pass
acc_core; the clean one also passesacc_edgeand absorbs the whole chain cheaply; the naive one is forced into a large rewrite by the trap change. This proves the design signal exists. Never grade with a broken/impossible chain. evals/experiments/exp-tdd-pays-<name>.jsonwith anexperimentblock listing the clarity variants and the chain (specClear,specVague,coreGrade/edgeGrade,changeChain,gradeChain).
Suggested tasks: notifier (multi-channel dispatch), pricing (stacking discount
rules), report-render (pluggable formats), event-store (append + projections),
command-registry (plugin dispatch), workflow (state machine + guards).
Grading pipeline (one pass, three instruments)¶
Each cell runs Stage 0 build → change chain 1..K, with acceptance hidden during every build and injected only at grading:
| Instrument | Measures | Answers |
|---|---|---|
| Stage-0 CORE vs EDGE pass | contract inference | ambiguity-inference (under clear, EDGE≈100% for all arms — the baseline that makes EDGE-under-vague interpretable) |
| Withheld change chain: dispatch cost + turns, blast radius (files/functions touched, public-API churn, whether the prior suite caught the regression before the fix) | changeability | changeability / refactoring-vs-ordering |
radon cc/mi over production modules at every stage |
structural trajectory | changeability (does complexity grow faster as changes pile up?) |
Multi-rater review: structure,complexity,naming,performance on prod + test-review on tests, K=3 passes averaged (mean ± stddev) |
code/test quality | shared "final review" (beats reviewer variance) |
| Interpretation variance across trials (distinct EDGE behaviors observed) | convergence | ambiguity-inference secondary (does tdd-refactor converge more?) |
Fixed procedure (follow exactly)¶
0. Preconditions¶
pip install coverage pytest radon. The four instruction arms (tdd-refactor,
tdd-no-refactor, test-after, bduf) need no plugin. The ship arm requires a
plugin-enabled $HOME template — build it once with cp -r ~/.claude/plugins
$TPL/.claude/ and pass --build-home-template $TPL to the harness (same pattern
as the build-pipeline arm in the prior 3-arm study; cost reads from the JSON
result, not the plugin meter). Confirm the model id and that nested claude -p
works (IS_SANDBOX=1 is set by the harness).
1. Model¶
One fixed, capable model for the whole run, reported. The cost/quality winner flips with model — hold it constant.
2. Arms (add to ARM_PROMPTS; keep PYTEST_RULE)¶
- tdd-refactor: strict TDD; after EACH test passes, REFACTOR toward the cleanest module boundaries/naming/duplication, re-run tests (stay green), then write the next test. Do not defer refactoring.
- tdd-no-refactor: test-first, but write only the MINIMUM to pass each test and DO NOT restructure — straight to the next test.
- test-after: all production code first, tests last.
- test-after-refactor: all production code first (same as
test-after), then write tests against the working implementation — capturing the actual contract including edge behaviours the implementation surfaced — then refactor with the test safety net in place. Do not write tests before seeing a working implementation; do not refactor before tests are green. This arm tests the hypothesis that deferred test commitment and a test-protected refactor combine additively:test-after's EDGE advantage (tests capture what the code actually does, not what a red-phase guess assumed) plustdd-refactor's changeability advantage (structural refactoring under test coverage), at lower cost thantdd-refactor(no iterative red-green cycles during initial build). - bduf: first write a short
DESIGN.md(modules + public interfaces), then implement the spec to that design, then write the tests. Note:bdufcommits to an architecture upfront; it does not author acceptance criteria. This isolates the "design commitment" mechanism from the "acceptance synthesis" mechanism tested byship. - ship (vague only; plugin arm): invoke the
/specs→/plan→/buildpipeline headlessly (self-approve at each human gate so it does not stall)./specsauthors explicit acceptance criteria from the vague spec before any code is written;/plandecomposes the feature into Gherkin-backed slices;/buildexecutes RED-GREEN-REFACTOR with three-stage inline review checkpoints. The key difference from all instruction arms: the agent must synthesise the acceptance contract (including the edge-case decisions the vague spec omitted) as an explicit artifact before implementation begins.
The spec the arm reads is spec_clear.md or spec_vague.md per the cell.
3. Author the open-design tasks (the craft step — do this first)¶
Per "Shared substrate". Calibrate two things by piloting one task:
- Trap: naive ref passes Stage 0 but is rewritten by the trap change; clean ref absorbs it. If both absorb equally → no design signal, re-author.
- Vagueness: under
vague,acc_coreis ~always passable andacc_edgeis sometimes missed. If EDGE is always passed, the vague spec leaked the answer; if never, it's unbuildable.
4. Extend the harness¶
Extend run_tdd_experiment.py (or a sibling) to, per cell:
- Select
spec_clear.md/spec_vague.md. - Stage 0 build → grade
acc_core.pyandacc_edge.pyseparately. - Change-chain stages 1..K, each seeded from the previous stage's files (fresh
dispatch, files only), prompt "apply {changeN}; keep existing tests green",
graded by
acc_changeN.py. - Per stage record: correctness; changeability (cost/turns + blast radius from
git diffbetween stages + public-API churn + prior-suite regression catch); radoncc/mi; and the K=3 averaged review-panel score (mean ± stddev). Keep acceptance hidden during each build.
5. Execute (sharded — instruction arms are cheap)¶
N = 3–5 trials per cell. Shard by task; cells are isolated, so run many runners
concurrently. Bound every test/coverage/mutant run with a wall-clock timeout
(already landed). Monitor non-destructively (row counts / pgrep); never kill the
session's own claude.
6. Analyze (per task, then paired across tasks)¶
- ambiguity-inference primary — EDGE pass:
tdd-refactor − test-afterundervague, paired across tasks (sign + Wilcoxon); read theworkflow × clarityinteraction against theclearanchors. - changeability primary — changeability: cumulative cost + cumulative blast radius to
absorb the whole chain, per task per arm; paired arm differences. refactoring-vs-ordering: the
tdd-no-refactorvstest-aftervstdd-refactorisolation. - clarity-interaction — the headline: quantify whether each arm's EDGE and changeability
advantage is largest in
vague + open-design. - spec-synthesis — spec synthesis:
ship − tdd-refactoron EDGE pass rate and cumulative changeability undervague, paired across tasks. Also compareshiptobdufon EDGE pass rate — this isolates acceptance-criteria synthesis from architecture commitment as mechanisms for resolving ambiguity. Reportship's cost premium over the cheapest vague arm to frame the "does spec synthesis pay?" trade-off. - test-after-refactor — test-after-refactor dominance: Three-way test against dominance hypothesis. (a) EDGE
under
vague:test-after-refactor≥test-after? If yes, deferred tests survive a refactor phase without losing their EDGE advantage. If no, the refactor changes what the code does, so tests written before refactoring no longer match the final contract. (b) Blast radius:test-after-refactor≈tdd-refactor? If yes, test-protected refactoring produces equivalent structural benefit regardless of when the tests were written. If no, the iterative TDD cycle produces a design that a post-implementation refactor cannot replicate. (c) Cost:test-after-refactor<tdd-refactor? This should hold by construction (no red-green iteration during build). All three must confirm to declare dominance hypothesis supported. A partial confirm is informative: it identifies which component oftdd-refactor's advantage (EDGE, changeability, or both) requires early test commitment. - Secondaries: radon trajectory; multi-rater code/test/design score (mean ± stddev — treat differences smaller than the stddev as noise); interpretation variance; regression-catch rate.
7. Report¶
Write docs/experiments/when-tdd-pays-report.md: the clarity × workflow grid, the
ambiguity-inference / changeability / clarity-interaction verdicts and the refactoring-vs-ordering mechanism isolation, the spec-synthesis
ship-vs-tdd-refactor spec-synthesis verdict, the test-after-refactor test-after-refactor
dominance verdict (all three conditions: EDGE ≥ test-after, blast radius ≈
tdd-refactor, cost < tdd-refactor), honest limitations (n, single model,
autonomous-only, reviewer variance, trap+vagueness calibration), reproducibility
commands, and a recommendation. Commit the report and raw data under
docs/experiments/data/.
Best path forward (based on first-run evidence, pending test-after-refactor confirmation)¶
The first run and the predicted outcome of the second run point to the same practical workflow hierarchy. Apply this pending test-after-refactor confirmation:
If requirements are clear and the design is settled:
test-after — cheapest arm ($0.19/stage), EDGE comparable to all arms under clear
spec (design is low-risk), blast radius acceptable (700). There is no need for
the refactor step when the design space is constrained.
If requirements are vague or the design is open:
test-after-refactor (predicted) — write all production code first so you see the
full shape of the problem before committing to a test contract, then write tests
against the working implementation to lock in the actual contract including emergent
edge behaviours, then refactor with the test safety net. Expected outcome: EDGE
≈67% (matching test-after), blast radius ≈664 (matching tdd-refactor), cost
≈$0.25–0.30/stage (between test-after and tdd-refactor).
If requirements are irreducibly vague (spec-gap, not ambiguity): Stop before any code. The notifier finding from the first run is definitive: when a vague spec omits semantics that cannot be inferred from context (e.g. per-channel retry behaviour), no coding workflow recovers the missing information. EDGE=0% for every arm regardless of workflow. The correct response is a clarifying conversation before any implementation starts. The workflow choice is irrelevant until the spec is complete enough to make edge decisions discoverable.
Decision table (predicted, pending test-after-refactor):
| Condition | Recommended arm | Why |
|---|---|---|
| Clear spec, settled design | test-after |
Cheapest; EDGE gap closes under clear spec |
| Vague spec OR open design | test-after-refactor |
Combines deferred commitment (EDGE) + test-safe refactor (changeability) at moderate cost |
| Need explicit acceptance artefact | ship |
/specs phase forces upfront contract synthesis; test if it resolves vagueness better than failing tests |
| Spec has irreducible gaps | Clarify first | No workflow recovers unstated semantics |
What would falsify this: test-after-refactor's null result. If test-after-refactor blast
radius is significantly worse than tdd-refactor (≥10% higher on average), then
the iterative TDD cycle shapes the design in ways a post-facto refactor cannot
replicate, and tdd-refactor remains the correct choice for open-design tasks
despite its higher cost. If test-after-refactor EDGE falls below test-after
(i.e. the refactor phase changes the implementation enough that post-refactor tests
diverge from pre-refactor ones), then the arm provides no EDGE advantage over
plain test-after and is simply tdd-refactor written in a different order.
Guardrails (lessons already paid for — do not relearn)¶
- Hide acceptance during every build (Stage 0 and each chain stage), or all arms just make the given tests pass and both signals die.
- Verify the refactor arm actually refactors.
tdd-refactormust show non-trivial structural churn between green and stage-end (diff intermediate vs final). If it doesn't, you're re-running the prior null — the whole point is that the earlier test-first arm stopped at GREEN. - Calibrate the trap AND the vagueness (per step 3): naive-punished/clean-absorbs, and core-always/edge-sometimes. An un-calibrated task answers nothing.
- Beat reviewer variance. Average K ≥ 3 review passes and report stddev; lean on the deterministic radon metrics and objective blast-radius/EDGE numbers as primary evidence, not the LLM review score (the prior run saw a naming agent score 0/19/4 on near-identical code).
- Cost/turns come from the JSON result, not the plugin meter (it doesn't fire
in nested dispatch). The harness reads
--output-format json. - Hold the model fixed and report it.
- Pre-register N, both primaries (EDGE under vague; cumulative changeability),
the clarity-interaction interaction, and the spec-synthesis
ship-vs-tdd-refactorcomparison before running; the task is the unit of inference; no data-dependent stopping. - Parallelize but isolate — each cell gets its own worktree +
$HOME. Theshiparm additionally needs its own plugin-enabled$HOMEtemplate (see step 0); do not share$HOMEbetweenshipcells and instruction cells.
Out of scope (by decision)¶
- Human / designer-in-the-loop and the clarification-oracle arm. Every arm runs fully autonomously; the experiment measures what the workflow alone produces.
Expected deliverables¶
evals/experiments/exp-tdd-pays-<task>.json+ open-design fixtures (stub,spec_clear.md,spec_vague.md, hiddenacc_core.py/acc_edge.py,change1..K.md,acc_change1..K.py), each validated against a naive and a clean reference.- The harness extension (clarity selection + CORE/EDGE grading + change-chain stages
- blast-radius/radon/multi-rater instrumentation +
shipplugin-home support). - Raw data JSONL under
docs/experiments/data/. - One report with the clarity × workflow grid and the ambiguity-inference / changeability / refactoring-vs-ordering / clarity-interaction / spec-synthesis / test-after-refactor verdicts and recommendation.