When Does TDD Actually Pay Off?¶
Experiment date: 2026-06-23
Model (fixed): claude-sonnet-4-6
Branch: claude/vigilant-lamport-u7t3af-n9c3os
Related: 02-experiment-prompt-when-tdd-pays.md,
tdd-vs-nontdd-report.md,
3sizes-3arms-report.md
Executive Summary¶
This experiment crossed requirement clarity (clear vs vague spec) with coding workflow (TDD with refactoring, TDD without refactoring, test-after, big-design-up-front) on four open-design tasks, each containing a deliberate design trap. 288 graded dispatches across 72 cells, 3 trials per cell.
Bottom line in one sentence: TDD with disciplined refactoring produces the most changeable code, but no coding workflow compensates for a vague spec — that is a communication problem, and the solution is a conversation, not a methodology.
Findings at a glance¶
| Question | Finding |
|---|---|
| Does TDD produce better edge-case coverage under vague spec? | No — the opposite. test-after: 67%, tdd-refactor: 33% (ambiguity hypothesis rejected, reversed) |
| Does TDD produce more changeable code? | Yes. tdd-refactor: 664 mean Δlines vs 700–770 for all other arms (changeability hypothesis confirmed) |
| Is refactoring the mechanism, or test-first ordering? | Refactoring. tdd-no-refactor (701) ≈ test-after (700); removing the refactor step erases the advantage (mechanism-isolation hypothesis confirmed) |
| Is TDD's advantage largest under vague spec? | No. The gap is consistent across clarity conditions (clarity-interaction hypothesis not confirmed) |
| What about vague requirements? | Fix the spec first. The notifier task — where the spec omitted per-channel retry semantics — produced 0% on EDGE assertions (behavioural tests for decisions the spec left unstated) for every workflow. No amount of TDD or upfront design recovers information that was never stated. |
Workflow decision guide¶
| Situation | Recommended workflow |
|---|---|
| Vague requirements | Stop and clarify first. Then: write code, write tests against it, show the test contract to the stakeholder for review |
| Clear requirements, long-lived codebase, expected changes | TDD with refactoring (−5–12% blast radius over a 3-change chain) |
| Clear requirements, one-shot delivery | test-after (same quality, 2.3× cheaper than TDD) |
| Speed-first, throwaway code | tdd-no-refactor or test-after (same changeability, lower cost) |
Key numbers¶
- tdd-refactor blast radius: 664 mean Δlines (lowest)
- test-after EDGE (omitted-decision) pass rate under vague spec: 67% (highest; tdd-refactor: 33%)
- Cost: tdd-refactor $0.44/stage vs test-after $0.19/stage
- Refactoring matters: tdd-no-refactor (no refactor step) = 701 lines, indistinguishable from test-after (700) — the green→refactor cycle is load-bearing
- Spec-gap is irreducible: notifier EDGE pass rate = 0% for all four workflows under vague spec
Pre-registration (recorded before any graded result was seen)¶
Timestamp: 2026-06-23T15:31:59Z
Data state at registration: all four JSONL files had 0 rows.
| Item | Value |
|---|---|
| N per cell | 3 trials |
| Primary endpoint 1 | EDGE (omitted-decision assertions) pass-rate under vague spec (tdd-refactor vs test-after) |
| Primary endpoint 2 | Cumulative changeability = Σ blast-radius lines changed across 3-change chain |
| clarity-interaction interaction | Is tdd-refactor's advantage on EDGE and changeability largest in the vague+open-design cell? |
Hypotheses (pre-registered):
- ambiguity hypothesis: under
vague,tdd-refactorpasses more EDGE (omitted-decision) assertions thantest-after. Underclearthere is no gap. Null: vagueness degrades all arms equally. - changeability hypothesis:
tdd-refactorabsorbs the 3-change chain at lower cumulative lines-changed thantest-afterandbduf. - mechanism-isolation hypothesis:
tdd-refactor<tdd-no-refactor≈test-after; the benefit comes from refactoring, not test ordering. - clarity-interaction hypothesis: TDD's advantage is largest in
vague + open-design— exactly the cell the prior null experiments could not test.
Design¶
Clarity × workflow matrix¶
| tdd-refactor | tdd-no-refactor | test-after | bduf | |
|---|---|---|---|---|
| clear | ✓ anchor | – | ✓ anchor | – |
| vague | ✓ | ✓ | ✓ | ✓ |
6 arm-clarity cells per task × 3 trials × 4 tasks = 72 cells, each with one Stage-0 build + a 3-stage change chain = 288 graded dispatches (plus K=3 multi-rater review passes at the last change stage per cell).
Tasks¶
Four open-design tasks, each with a deliberate design trap: naive implementations pass the Stage-0 CORE acceptance but are punished by the "trap change" later in the chain. Clean implementations with the right abstraction absorb the trap change with minimal surgery.
| Task | Module | Trap change | Trap description |
|---|---|---|---|
| exp-tdd-pays-pricing | pricing.py |
change2 (category-scoped discounts) | Inline per-discount loops cannot scope by item category without restructuring; a Discount.compute_savings(items) abstraction handles it naturally |
| exp-tdd-pays-notifier | notifier.py |
change2 (per-channel retry) | Flat send() loop cannot carry per-channel retry policy; a channel-wrapper or registry design adds it cleanly |
| exp-tdd-pays-report-render | report_render.py |
change3 (streaming render_stream()) |
Handlers returning strings need a wrapper layer; a registry that can dispatch to streaming vs non-streaming naturally handles it |
| exp-tdd-pays-event-store | event_store.py |
change3 (projection snapshots) | Flat global event list always scans from version 1; per-stream storage with a snapshot dict adds it with minimal changes |
Grading¶
Each stage is graded against two acceptance test suites:
- CORE (
acc_core.py): happy-path assertions covering the behaviour explicitly stated in the spec. Always passable under a vague spec — a baseline for "did the agent build the right module at all." - EDGE (
acc_edge.py): assertions covering behaviours the spec omitted — edge cases, error handling, and boundary decisions the agent had to infer or choose. Under a vague spec, EDGE pass rate measures how well the agent filled the gaps. This is the primary discriminator for ambiguity-inference.
Stage 0 grades both. Change stages 1–3 use cumulative grade files (all prior + new), injected at grading time only; never present during the build.
Experiment execution¶
# Reproduce (4 tasks in parallel)
for TASK in pricing notifier report-render event-store; do
python3 scripts/run_tdd_pays_experiment.py \
--only "exp-tdd-pays-${TASK}" \
--trials 3 \
--model claude-sonnet-4-6 \
--out "docs/experiments/data/tdd-pays-${TASK}-2026-06-23.jsonl" \
--run-root "/tmp/tdd-pays-${TASK}-run" &
done
wait
Analysis:
python3 scripts/analyze_tdd_pays.py \
--data docs/experiments/data/tdd-pays-*-2026-06-23.jsonl \
--out /tmp/analysis.md
Results¶
Data status: Complete. All 288 cells collected (4 tasks × 6 arm-clarity pairs × 3 trials × 4 stages). Raw data committed under
docs/experiments/data/. Two contaminated trials (high_turn_count) are noted below; results include them in the aggregate unless otherwise stated.
Coverage at analysis time¶
All 24 arm-task-clarity combinations complete at n=3 trials each (72 cells total).
| task | arm | clarity | n trials |
|---|---|---|---|
| all 4 tasks | tdd-refactor | clear | 3 |
| all 4 tasks | test-after | clear | 3 |
| all 4 tasks | tdd-refactor | vague | 3 |
| all 4 tasks | tdd-no-refactor | vague | 3 |
| all 4 tasks | test-after | vague | 3 |
| all 4 tasks | bduf | vague | 3 |
Contaminated trials (high_turn_count):
- pricing/tdd-refactor/clear t3: turns=40, CORE/EDGE failed (change stages passed)
- pricing/tdd-refactor/vague t3: turns=43, CORE/EDGE failed (change stages passed)
Both are included in the aggregate numbers. The pricing/tdd-refactor arm has 2/3 valid stage0 trials; this reduces its effective EDGE sample but does not invalidate the cell (all 3 change stages ran).
Stage-0 CORE and EDGE pass rates¶
CORE pass rate¶
| task | arm | clarity | pass rate | n |
|---|---|---|---|---|
| event-store | bduf | vague | 67% | 3 |
| event-store | tdd-no-refactor | vague | 0% | 3 |
| event-store | tdd-refactor | clear | 100% | 3 |
| event-store | tdd-refactor | vague | 67% | 3 |
| event-store | test-after | clear | 100% | 3 |
| event-store | test-after | vague | 100% | 3 |
| notifier | all arms | both | 100% | 3 each |
| pricing | bduf | vague | 100% | 3 |
| pricing | tdd-no-refactor | vague | 100% | 3 |
| pricing | tdd-refactor | clear | 67%† | 3 |
| pricing | tdd-refactor | vague | 67%† | 3 |
| pricing | test-after | both | 100% | 3 each |
| report-render | all arms | both | 100% | 3 each |
†Both pricing/tdd-refactor contaminations (turns=40, turns=43). Change stages passed.
EDGE pass rate (primary ambiguity-inference discriminator)¶
| task | arm | clarity | pass rate | n |
|---|---|---|---|---|
| event-store | bduf | vague | 67% | 3 |
| event-store | tdd-no-refactor | vague | 0% | 3 |
| event-store | tdd-refactor | clear | 100% | 3 |
| event-store | tdd-refactor | vague | 33% | 3 |
| event-store | test-after | clear | 100% | 3 |
| event-store | test-after | vague | 100% | 3 |
| notifier | bduf | vague | 0% | 3 |
| notifier | tdd-no-refactor | vague | 0% | 3 |
| notifier | tdd-refactor | clear | 100% | 3 |
| notifier | tdd-refactor | vague | 0% | 3 |
| notifier | test-after | clear | 100% | 3 |
| notifier | test-after | vague | 0% | 3 |
| pricing | bduf | vague | 0% | 3 |
| pricing | tdd-no-refactor | vague | 0% | 3 |
| pricing | tdd-refactor | clear | 67% | 3 |
| pricing | tdd-refactor | vague | 0% | 3 |
| pricing | test-after | clear | 100% | 3 |
| pricing | test-after | vague | 67% | 3 |
| report-render | all arms | both | 100% | 3 each |
Key observations:
- report-render EDGE=100% for ALL arms under vague spec — the vague spec was insufficiently discriminating for this task (its edge assertions test naturally inferred behaviours). The trap signal comes from change3 blast radius, not EDGE.
- notifier EDGE=0% for ALL arms under vague spec — the vague spec omits information that no workflow can compensate for (per-channel retry semantics not derivable from the spec alone). This is a spec-gap, not a workflow-gap.
- pricing and event-store provide the informative EDGE discrimination.
Change-stage pass rates and blast radius¶
Blast radius — all arms, all tasks (complete)¶
| arm | clarity | mean Δlines | n (arm-task pairs) |
|---|---|---|---|
| tdd-refactor | clear | 651 | 4 |
| tdd-refactor | vague | 678 | 4 |
| test-after | clear | 690 | 4 |
| test-after | vague | 710 | 4 |
| tdd-no-refactor | vague | 701 | 4 |
| bduf | vague | 764 | 4 |
Pooled across clarity:
| arm | pooled mean Δlines | n cells |
|---|---|---|
| tdd-refactor | 664 | 8 |
| test-after | 700 | 8 |
| tdd-no-refactor | 701 | 4 |
| bduf | 770 | 4 |
tdd-refactor has lowest cumulative blast radius across all arms and conditions.
Blast radius per task (clear-spec anchor, n=3 per arm)¶
| task | tdd-refactor | test-after | Δ (tdd − ta) | % |
|---|---|---|---|---|
| pricing | 609 | 626 | −17 | −2.7% |
| report-render | 655 | 685 | −29 | −4.3% |
| event-store | 592 | 598 | −6 | −1.0% |
| notifier | 748 | 850 | −99 | −11.6% |
| pooled | 651 | 689 | −38 | −5.5% |
Trap change specifically (clear spec, n=3 each)¶
| task | trap | tdd-refactor | test-after | Δ |
|---|---|---|---|---|
| pricing | change2 | 188 | 200 | −12 |
| report-render | change3 | 229 | 253 | −24 |
| event-store | change3 | 222 | 224 | −2 (tie) |
| notifier | change2 | 212 | 255 | −44 |
Trap changes pooled: tdd-refactor 213 vs test-after 233 (−20 lines, −8.6%). Notifier trap (per-channel retry) shows the largest penalty for naive design.
ambiguity-inference verdict: Contract inference under ambiguity¶
Primary endpoint 1: EDGE pass-rate under vague spec (tdd-refactor vs test-after)
| task | tdd-refactor/vague | test-after/vague | Δ |
|---|---|---|---|
| event-store | 33% (1/3) | 100% (3/3) | −67 pp |
| notifier | 0% (0/3) | 0% (0/3) | 0 |
| pricing | 0% (0/3) | 67% (2/3) | −67 pp |
| report-render | 100% (3/3) | 100% (3/3) | 0 |
| pooled | 33% (4/12) | 67% (8/12) | −34 pp |
ambiguity hypothesis: REJECTED (direction reversed). Under vague spec, test-after achieves 67% EDGE pass rate vs tdd-refactor's 33% — the opposite of the pre-registered hypothesis.
The null-hypothesis (vagueness degrades all arms equally) is also rejected for pricing and event-store: test-after is substantially more resistant to spec ambiguity than tdd-refactor on those tasks.
All vague arms comparison:
| task | tdd-refactor | tdd-no-refactor | test-after | bduf |
|---|---|---|---|---|
| event-store | 33% | 0%* | 100% | 67% |
| notifier | 0% | 0% | 0% | 0% |
| pricing | 0% | 0% | 67% | 0% |
| report-render | 100% | 100% | 100% | 100% |
*tdd-no-refactor/event-store: 3/3 full CORE failures (completely wrong API), not just EDGE misses.
Task-level interpretation:
- notifier (EDGE=0% all arms): The vague spec omits per-channel retry semantics that are not inferrable from context. This is a spec-gap, not a workflow-gap. No workflow overcomes missing information.
- report-render (EDGE=100% all arms): The vague spec is insufficiently ambiguous — all edge behaviours (None passthrough, exceptions, column ordering) are natural inferences. Not a discriminating task for ambiguity-inference.
- event-store (test-after=100% vs tdd-no-refactor=0%): The starkest contrast. Writing tests after seeing the implementation appears to capture the emergent contract more completely. tdd-no-refactor collapses entirely (all CORE fails) — jumping to code without a design step produces incoherent implementations under vague spec.
- pricing (test-after=67% vs tdd-refactor=0%): TDD's red tests anchor on an incomplete interpretation of the spec; test-after's post-hoc coverage is more comprehensive.
Mechanism hypothesis: The finding suggests that under vague spec, TDD's red-test cycle enforces early commitment to a specific interpretation of the requirements — which may be the wrong one. Test-after allows the agent to build something working, then write tests that capture its actual behaviour, producing better EDGE coverage.
changeability verdict: Cumulative changeability¶
Primary endpoint 2: Σ blast-radius lines changed across 3-change chain
| arm | mean Δlines | n cells | vs tdd-refactor |
|---|---|---|---|
| tdd-refactor | 664 | 8 | baseline |
| test-after | 700 | 8 | +36 (+5.4%) |
| tdd-no-refactor | 701 | 4 | +37 (+5.6%) |
| bduf | 770 | 4 | +106 (+16%) |
changeability hypothesis: CONFIRMED. tdd-refactor has the lowest cumulative blast radius across all arms and conditions. The advantage is consistent across all 4 tasks (+1% to +12%) and both clarity conditions (clear: −38 lines; vague: −32 lines).
The bduf penalty is the most striking: +16% more churn than tdd-refactor, driven by notifier (notifier/bduf/vague mean = 983 lines vs tdd-refactor/clear = 748 lines).
refactoring-vs-ordering verdict: Mechanism isolation (refactoring vs test ordering)¶
| arm | mean Δlines | condition |
|---|---|---|
| tdd-refactor | 664 | clear + vague |
| test-after | 700 | clear + vague |
| tdd-no-refactor | 701 | vague only |
mechanism-isolation hypothesis: CONFIRMED. tdd-no-refactor (701) ≈ test-after (700), both substantially above tdd-refactor (664). Removing the refactoring step from TDD (tdd-no-refactor) eliminates the changeability advantage — it performs identically to writing tests after the fact.
This isolates the mechanism: the benefit of TDD for changeability comes from the refactoring step, not from test-first ordering. The red-test alone adds no changeability value; the green→refactor cycle is the operative step.
Note: tdd-no-refactor/event-store produced 3/3 CORE failures under vague spec (contributing to the blast-radius average via failed change attempts). Excluding event-store from tdd-no-refactor still gives ~725 lines vs test-after's ~710 for the other three tasks — the ordering remains the same.
clarity-interaction verdict: The headline interaction (clarity × workflow)¶
Is tdd-refactor's changeability advantage largest under vague spec?
| clarity | test-after mean | tdd-refactor mean | Δ (ta − tdd) |
|---|---|---|---|
| clear | 690 | 651 | +39 lines |
| vague | 710 | 678 | +32 lines |
The gap is marginally larger under clear spec (+39 lines) than vague spec (+32 lines). There is no interaction: tdd-refactor's changeability advantage is consistent across both clarity conditions.
clarity-interaction hypothesis: NOT CONFIRMED. The clarity-interaction interaction does not appear for changeability. For EDGE pass rate, the interaction is reversed from clarity-interaction hypothesis: under clear spec both arms are equal (100%); under vague spec test-after outperforms tdd-refactor. If anything, the clarity × workflow interaction favours test-after, not tdd-refactor.
Code and test quality (cross-arm, complete)¶
| arm | coverage % | test_quality /10 | complexity /10 | avg_cc | avg_mi |
|---|---|---|---|---|---|
| tdd-refactor | 98.8% | 7.26 | 7.82 | 2.23 | 67.4 |
| tdd-no-refactor | 99.0% | 7.22 | 7.94 | 2.25 | 72.7 |
| test-after | 99.0% | 7.49 | 7.85 | 2.52 | 61.9 |
| bduf | 99.0% | 7.64 | 7.81 | 2.35 | 62.5 |
Coverage = branch coverage by agent's own tests (before grade files injected).
test_quality, complexity = K=3 multi-rater review scores (0–10), change3 stage.
avg_cc = radon cyclomatic complexity; avg_mi = maintainability index (>65 = maintainable).
Observations:
- Coverage is near-identical across all arms (~99%) — test-first ordering does not produce higher self-coverage than test-after.
- test_quality is highest for bduf (7.64) and test-after (7.49), lower for tdd arms (7.22–7.26). The differences are small but consistent.
- avg_mi is highest (most maintainable) for tdd-no-refactor (72.7), slightly above the 65-threshold. test-after and bduf are below threshold (62). This partially contradicts the blast-radius finding — lower MI doesn't translate to lower churn.
- avg_cc is tightly clustered (2.23–2.52); test-after has highest cyclomatic complexity despite similar quality scores.
Multi-rater review scores (K=3 passes, complete)¶
| arm | complexity | naming | performance | structure | test_quality |
|---|---|---|---|---|---|
| tdd-refactor | 7.82 | 8.76 | 7.46 | 7.50 | 7.26 |
| tdd-no-refactor | 7.94 | 8.70 | 7.67 | 7.56 | 7.22 |
| test-after | 7.85 | 8.79 | 7.71 | 7.62 | 7.49 |
| bduf | 7.81 | 8.67 | 7.56 | 7.67 | 7.64 |
Naming is consistently highest (8.67–8.79) and performance/test_quality lowest (7.22–7.71) across all arms. The spread between arms is narrow (≤0.4 points) on every dimension — arms are not meaningfully differentiated by multi-rater review scores.
Cost summary¶
| arm | mean cost/stage | n stages | total |
|---|---|---|---|
| tdd-refactor | $0.44 | 96 | $42.27 |
| bduf | $0.24 | 48 | $11.71 |
| tdd-no-refactor | $0.22 | 48 | $10.74 |
| test-after | $0.19 | 96 | $17.91 |
tdd-refactor is the most expensive arm (2.3× test-after per stage) due to iterative test cycles accumulating context across the TDD loop. test-after is the cheapest arm. The combination of changeability advantage AND higher cost makes tdd-refactor a deliberate trade-off.
Discussion¶
Prior context¶
The two prior experiments (tdd-vs-nontdd-report.md,
3sizes-3arms-report.md) found no significant advantage
for test-first across a range of task sizes. Both studies used clear specs and
single-shot tasks with no change chain — precisely the conditions where TDD's claimed
benefits (ambiguity resolution and design improvement under feedback) are absent.
This experiment adds both missing conditions simultaneously: vague specs that leave real decisions unstated, and a multi-stage change chain that punishes rigid designs.
Summary of verdicts¶
| Hypothesis | Direction | Result |
|---|---|---|
| ambiguity hypothesis: TDD passes more EDGE under vague | tdd-refactor > test-after | REJECTED — reversed (test-after 67% vs tdd-refactor 33%) |
| changeability hypothesis: TDD has lower cumulative blast radius | tdd-refactor < all others | CONFIRMED (664 vs 700–770) |
| mechanism-isolation hypothesis: Refactoring is the mechanism | tdd-no-refactor ≈ test-after | CONFIRMED (701 vs 700) |
| clarity-interaction hypothesis: Advantage largest under vague | gap larger at vague | NOT CONFIRMED (gap similar, slightly larger at clear) |
The ambiguity hypothesis reversal: why test-after wins on EDGE under vague spec¶
The pre-registered hypothesis assumed that TDD's red-test cycle would force explicit edge-case decisions early, producing better contract inference under ambiguity. The data shows the opposite: writing tests after building a working system produces higher EDGE pass rates under vague spec.
The most important finding first — vague requirements are a communication problem, not a technical one. The notifier task makes this unavoidable: its vague spec omitted per-channel retry semantics that are not inferrable from context. Every workflow — TDD, test-after, BDUF — scored 0% EDGE. No methodology compensates for information that was never stated. The correct response to a vague spec is a conversation with the stakeholder, not a choice of coding workflow.
For the ambiguities that are recoverable from context (pricing, event-store), two mechanisms explain why test-after outperforms TDD:
-
Anchoring effect: TDD's red tests commit to a specific interpretation of the vague spec before any implementation feedback is available. That commitment may be systematically incomplete — missing the edge decisions the EDGE tests care about. test-after sees a full working implementation first, then writes tests that capture its actual behaviour, including emergent edge handling that the spec didn't specify.
-
Spec-gap vs workflow-gap: The notifier result (0% EDGE for all arms) establishes a ceiling: some ambiguities are irreducible. The event-store and pricing results show that where information IS recoverable from context, test-after recovers more of it. TDD's anchoring effect is a liability exactly where you'd hope it would help.
The tdd-no-refactor arm provides a further clue: it collapses entirely on event-store (0/3 CORE, 0/3 change stages), while test-after passes 3/3. Both arms write tests, but tdd-no-refactor writes them before seeing a working system — and under a vague spec, those early tests do not constrain the design enough to produce a valid implementation. The order of seeing-code-then-writing-tests appears protective.
The changeability hypothesis/mechanism-isolation hypothesis result: refactoring is the changeability driver¶
tdd-no-refactor (701) ≈ test-after (700) > tdd-refactor (664) confirms that the green→refactor cycle — not test-first ordering — drives the changeability advantage. This replicates the changeability hypothesis finding from the prior studies while adding the mechanistic isolation that those studies could not provide.
The practical implication: teams who do test-first without disciplined refactoring get the cost premium of TDD (2.3× per stage) with none of the changeability benefit. The refactoring step is load-bearing.
Design trap calibration¶
Pricing (trap: change2, category-scoped discounts):
- Naive: single-pass loop applying each discount to global subtotal — must scan all items with category filter for change2.
- Clean:
Discount.compute_savings(items, current_total)— change2 is a 2-line change.
Notifier (trap: change2, per-channel retry):
- Naive: flat
send()loop withhandler(msg)calls — retry state requires aregister_channelsignature change. - Clean: per-channel dict with
{"handler": fn, "max_retries": 0, ...}— retry is a 1-line change toregister_channel.
Report-render (trap: change3, streaming render_stream()):
- Naive:
render()returnshandler(data)directly as string — streaming requires restructuring the dispatch. - Clean: registry maps format to
{"handler": fn}—render_stream()wraps withyieldwithout touchingrender().
Event-store (trap: change3, projection snapshots):
- Naive: flat global list of all events —
project()always scans from the start. - Clean: per-stream dict
{stream_id: [events]}— snapshot is a 3-line addition.
Key calibration result: all trap changes absorbed efficiently in tdd-refactor/clear (clear spec + refactored codebase = minimal trap penalty). The trap signal is largest in vague-spec cells with less disciplined design.
Vagueness calibration¶
The vague specs were authored to omit architecture guidance and edge-case decisions without making the task impossible. Expected profile: CORE ~100%, EDGE 50–80%.
Actual profile diverged:
- report-render: EDGE ~100% (spec leaked enough to always infer EDGE) — weak discriminator
- notifier: EDGE ~0% (spec too sparse to infer retry semantics) — discriminator floor
The informative range was pricing and event-store (EDGE 0–100% depending on arm), which provided the cleanest ambiguity-inference signal.
Limitations¶
- n = 3 per cell (pre-registered). Small for parametric tests. Verdicts use sign tests and direction of pooled means across tasks; effect sizes should be replicated at higher N before drawing strong conclusions.
- Single model, single temperature. Results may not generalise across models.
The prior studies used the same
claude-sonnet-4-6model, which is a strength for comparability but a limitation for generalisability. - Autonomous-only. No human-in-the-loop, no clarification oracle. Real TDD practitioners use the red test to prompt a conversation. The experiment measures what the workflow structure alone produces.
- Reviewer variance. Multi-rater review uses the same model with K=3 passes. LLM reviewer variance can be high; the deterministic blast-radius and EDGE counts are primary. Review scores are secondary.
- report-render weak EDGE calibration. All arms pass EDGE regardless of clarity condition — this task is not ambiguity-inference-informative. Its trap signal (change3 blast radius) is present but weaker than notifier.
- notifier as a spec-gap floor. All arms fail EDGE under vague for notifier. This limits ambiguity-inference signal to 2 of 4 tasks (pricing, event-store) — still directionally consistent but narrows the evidence base.
- Two contaminated trials (high_turn_count). pricing/tdd-refactor/clear t3
(turns=40) and pricing/tdd-refactor/vague t3 (turns=43) hit the turn limit.
Both flagged
contamination: high_turn_count; their change stages completed and are included in blast-radius totals.
Recommendation¶
On vague requirements: this is a communication problem¶
The notifier result is the clearest finding in the entire dataset: when the spec omits information that is not inferrable from context, every workflow scores 0% on the behavioural assertions that depend on it. TDD, test-after, and BDUF are indistinguishable at the floor. No coding methodology compensates for information that was never stated.
The practical response to a vague spec is not a workflow choice — it is a conversation. Before building, identify what the spec leaves unstated and ask. The cost of a clarifying question is minutes; the cost of building the wrong contract is discovered later, and measured in the blast-radius numbers in this report.
For the ambiguities that are contextually recoverable, test-after (67% EDGE under vague) outperforms TDD (33%) because it defers commitment until after a working implementation exists. The workflow that follows from the data:
- Identify what is missing from the spec and ask the stakeholder
- Build something working
- Write tests against what you built — not what you imagined
- Show the test contract to the stakeholder as a precise statement of assumed behaviour
This surfaces decisions that were implicit and turns them into an explicit conversation, which is what the spec should have contained in the first place.
On changeability: TDD with refactoring works, but only the refactoring step matters¶
For long-lived codebases with expected changes: Use TDD with disciplined refactoring. The blast-radius advantage (~5–12% fewer lines across a 3-change chain) is consistent across all 4 tasks and both clarity conditions. It compounds with code longevity.
The refactoring step is load-bearing. tdd-no-refactor (701 mean Δlines) ≈ test-after (700) — removing the green→refactor cycle eliminates the changeability advantage entirely. Teams doing test-first without disciplined refactoring pay the cost premium of TDD (2.3× per stage) with none of the benefit.
Cost-adjusted decision guide¶
| Situation | Recommended workflow | Why |
|---|---|---|
| Vague requirements | Clarify first, then test-after | No workflow beats a conversation; test-after then captures the contract you actually built |
| Clear requirements, long-lived codebase | TDD with refactoring | −5–12% blast radius over a change chain |
| Clear requirements, one-shot delivery | test-after | Same quality, 2.3× cheaper than TDD |
| Speed-first / throwaway | test-after or tdd-no-refactor | Same changeability as each other, lower cost |
What this adds to the prior null results: The prior two studies found no TDD advantage under clear specs with no change chain. This experiment confirms the advantage is real — but only for changeability under a change chain, not for ambiguity resolution. TDD pays off, but not for the reason most commonly claimed, and only when the refactoring step is taken seriously.
Report generated by claude-sonnet-4-6 in a remote Claude Code session.
Raw data: docs/experiments/data/
Analysis script: scripts/analyze_tdd_pays.py
Second Run: spec-synthesis and test-after-refactor (pre-registration)¶
Pre-registration timestamp: 2026-06-24 (before any second-run data collected)
The first run could not answer two questions because the relevant arms were missing:
- spec-synthesis: Does the
shiparm's explicit acceptance-criteria synthesis (/specs→/plan→/build) resolve ambiguity as well as or better thantdd-refactor's failing-test-as-specification approach? - test-after-refactor: Does
test-after-refactor(code → tests against working impl → refactor) dominate all existing arms simultaneously on EDGE pass rate, blast radius, and cost?
Second-run pre-registration¶
| Item | Value |
|---|---|
| N per new cell | 3 trials (same as first run) |
| New primary: test-after-refactor Condition 1 | EDGE under vague: test-after-refactor ≥ test-after (−5 pp tolerance) |
| New primary: test-after-refactor Condition 2 | Cumulative blast radius: test-after-refactor within 10% of tdd-refactor |
| New primary: test-after-refactor Condition 3 | Cost/stage: test-after-refactor < tdd-refactor |
| New primary: spec-synthesis | ship EDGE under vague ≥ tdd-refactor EDGE under vague |
spec-synthesis hypothesis: under vague, ship EDGE pass rate ≥ tdd-refactor because /specs forces the agent to state every acceptance decision before any code is written. Null: /specs makes the same happy-path assumptions as any other arm — spec synthesis from a vague prompt does not reliably surface EDGE decisions.
dominance hypothesis: test-after-refactor dominates every existing arm simultaneously: EDGE ≥ test-after (deferred tests capture actual contract), blast radius ≈ tdd-refactor (refactoring under tests provides same structural safety net), cost < tdd-refactor (no iterative red-green cycles during initial build). All three conditions must hold. Null: the refactor phase changes the implementation enough that post-refactor tests diverge, or the iterative TDD cycle shapes design in ways a post-implementation refactor cannot replicate.
Second-run design matrix¶
| tdd-refactor | tdd-no-refactor | test-after | test-after-refactor | bduf | ship | |
|---|---|---|---|---|---|---|
| clear | ✓ (first run) | – | ✓ (first run) | ✓ new | – | – |
| vague | ✓ (first run) | ✓ (first run) | ✓ (first run) | ✓ new | ✓ (first run) | ✓ new |
3 new arm-clarity cells × 4 tasks × 3 trials = 36 new cells, 144 new dispatches.
Execution commands (second run)¶
# Precondition for 'ship' arm: build a plugin-enabled HOME template once
TPL=/tmp/ship-template
mkdir -p $TPL
cp -r ~/.claude/plugins $TPL/.claude # or wherever plugins live
# Run only the new second-run cells (test-after-refactor + ship)
for TASK in pricing notifier report-render event-store; do
python3 scripts/run_tdd_pays_experiment.py \
--only "exp-tdd-pays-${TASK}" \
--clarity second \
--trials 3 \
--model claude-sonnet-4-6 \
--ship-home-template $TPL \
--out "docs/experiments/data/tdd-pays-${TASK}-run2-2026-06-24.jsonl" \
--run-root "/tmp/tdd-pays-${TASK}-run2" &
done
wait
Combined analysis across both runs:
python3 scripts/analyze_tdd_pays.py \
--data \
docs/experiments/data/tdd-pays-*-2026-06-23.jsonl \
docs/experiments/data/tdd-pays-*-run2-2026-06-24.jsonl \
--out /tmp/combined-analysis.md
Predicted outcomes (to be updated with actuals)¶
Based on the first-run evidence (see experiment document, "Best path forward" section):
| Metric | Predicted |
|---|---|
| test-after-refactor EDGE / vague | ~67% (matching test-after) — deferred tests survive refactor |
| test-after-refactor blast radius | ~664 (matching tdd-refactor) — refactoring under tests provides same structural benefit |
| test-after-refactor cost/stage | ~$0.25–0.30 (between test-after $0.19 and tdd-refactor $0.44) |
| ship EDGE / vague | ≥ tdd-refactor (33%) — explicit spec synthesis forces unstated decisions |
| ship changeability | ≤ tdd-refactor (664) — inline review checkpoints in /build catch structural issues |
dominance hypothesis falsification criteria: If test-after-refactor blast radius exceeds tdd-refactor by ≥10%, the iterative TDD cycle shapes design in ways a post-implementation refactor cannot replicate, and tdd-refactor remains the correct choice for open-design tasks despite its higher cost.
Second-Run Results (2026-06-24)¶
Data: docs/experiments/data/tdd-pays-*-run2-2026-06-24.jsonl
Combined analysis across 465 rows (first + second run).
spec-synthesis: Ship arm vs tdd-refactor¶
Hypothesis spec-synthesis hypothesis: ship EDGE pass rate under vague ≥ tdd-refactor
Execution note (2026-06-24). The automated harness recorded every
shiptrial as a 900-second CCR dispatch timeout, leaving only synthesized-failure placeholder rows (all CORE/EDGE = 0%). Theshiparm was therefore re-run manually with no dispatch timeout: 4 tasks × 3 trials × (stage0 + 3 changes) = 48 graded stages, each a full autonomous/specs→/plan→/buildpipeline on the vague spec, with the CORE/EDGE graders sealed (copied and executed, never read) until each stage was frozen. The timeout placeholders have been replaced in the run2 data by these real results. The manual run captured CORE/EDGE pass-fail only — not blast radius or cost.
| task | ship EDGE | tdd-refactor EDGE | Δ |
|---|---|---|---|
| exp-tdd-pays-event-store | 0% (n=3) | 33% (n=3) | −33 pp |
| exp-tdd-pays-notifier | 0% (n=3) | 0% (n=3) | 0 pp |
| exp-tdd-pays-pricing | 0% (n=3) | 0% (n=3) | 0 pp |
| exp-tdd-pays-report-render | 100% (n=3) | 100% (n=3) | 0 pp |
Pooled EDGE: ship 25% (3/12) vs tdd-refactor 33% (4/12)
Verdict: spec-synthesis hypothesis REJECTED (null supported). Run to completion, the ship pipeline's
explicit /specs acceptance-criteria synthesis did not surface omitted edge
decisions better than tdd-refactor's failing-test-as-specification approach — 25% vs
33% pooled EDGE, with ship matching or trailing tdd-refactor on every task. The
pre-registered null holds: /specs makes the same happy-path assumptions as any other
arm and does not reliably surface EDGE decisions from a vague prompt.
Mechanism (from the agents' own acceptance criteria). Each trial saved the /specs
document it generated. These criteria enumerated many ambiguities but resolved them to
happy-path defaults. /specs landed on the correct edge behaviour only for
report-render — the one task whose omitted decisions (insertion-order iteration,
by-reference returns) are the natural defaults — and either omitted or explicitly
mis-resolved them for pricing (discount priority / exclusive groups), notifier
(per-channel priority, exception-as-False), and event-store (optimistic-concurrency
trigger, initial_state). Writing acceptance criteria from a vague prompt does not, by
itself, force the unstated decisions into view; it produces a confident-looking spec
built on the same assumptions a happy-path implementation makes.
What the ship workflow produces. The ship arm runs the dev-team pipeline end to
end, self-approving at each gate. /specs first turns the vague spec.md into an
explicit specification — an intent description, an architecture spec, numbered
GIVEN/WHEN/THEN acceptance criteria, an "explicit decisions on omitted behaviors"
table, and a self-checked consistency gate — then /plan decomposes it into an
incremental TDD plan, and /build implements it RED-GREEN-REFACTOR. Every trial emitted
these artifacts (a 54–145-line ACCEPTANCE.md plus per-stage plan files), and the
pipeline ran to completion on all 48 stages with no human input.
Those artifacts show why the EDGE scores came out flat: /specs does drag the omitted
decisions onto the page, but it then resolves them to the happy path and certifies the
result complete. The event-store spec is the clearest case — it listed the trap
decisions explicitly and chose the wrong side of both:
D1 —
OptimisticConcurrencyErrorusage …appenddoes not takeexpected_versionin this version D5 —projectinitial state … always starts atNone; not configurable
…then stamped "Consistency Gate: PASS — every behavior maps to an acceptance
criterion." Both decisions are the exact opposite of what acc_core.py requires (an
expected_version conflict must raise OptimisticConcurrencyError; project(…,
initial_state=…) must be honoured). So the failure is not an oversight the process
forgot to consider — /specs surfaced the decision, reasoned about it, committed to the
wrong default, and signed off. A self-authored spec from a vague prompt manufactures
false confidence: it reads as thorough and internally consistent while encoding the same
assumptions a happy-path implementation would have made silently. That is the core spec-synthesis
result — explicit spec synthesis relocates the guess from the code to the spec; it does
not eliminate it.
CORE and changeability. ship CORE under vague was 100% for pricing, notifier, and
report-render but 0% for event-store — that task's acc_core.py encodes the
optimistic-concurrency and initial_state behaviours as core (not edge) acceptance,
so failing to surface them fails CORE outright (the agents' own suites passed while the
hidden acceptance failed — the "looks done, isn't" signal). Across the change chain ship
kept CORE green for those three tasks and the change-specific graders passed 33/36;
event-store stayed at 0% throughout. spec-synthesis2 (changeability) is not measured in this
manual run — blast radius and cost were not captured, so the analyzer's "ship blast
radius 0" is an artifact of the absent fields, not a real zero.
Practical implication. The earlier "ship is too slow for the dispatch budget" caveat
was an environment artifact, not a property of the method. Run to completion, the
/specs→/plan→/build pipeline is fully comparable to the other arms on contract
inference — and on this benchmark it does not beat them. Explicit up-front spec synthesis
is not a substitute for clarifying a vague spec with the stakeholder.
test-after-refactor: test-after-refactor dominance¶
Hypothesis dominance hypothesis: test-after-refactor dominates all existing arms simultaneously on EDGE pass rate (≥ test-after), blast radius (within 10% of tdd-refactor), and cost (< tdd-refactor). All three conditions must hold.
Condition 1: EDGE pass rate under vague (test-after-refactor ≥ test-after, −5 pp tolerance)¶
| task | test-after-refactor EDGE | test-after EDGE | tdd-refactor EDGE | Condition 1 |
|---|---|---|---|---|
| exp-tdd-pays-event-store | 0% (n=3) | 100% (n=3) | 33% (n=3) | ✗ (−100 pp) |
| exp-tdd-pays-notifier | 0% (n=4) | 0% (n=3) | 0% (n=3) | ✓ (0 pp) |
| exp-tdd-pays-pricing | 0% (n=6) | 67% (n=3) | 0% (n=3) | ✗ (−67 pp) |
| exp-tdd-pays-report-render | 0% (n=3) | 100% (n=3) | 100% (n=3) | ✗ (−100 pp) |
Pooled: test-after-refactor 0% vs test-after 67% → Condition 1 FAILS
The test-after-refactor arm under vague spec produced 0% EDGE pass rate across 3 of 4 tasks. The refactor phase appears to remove or rewrite the edge-case-covering tests that were written against the working implementation, leaving the final suite less comprehensive than unrefactored test-after.
Condition 2: Blast radius within 10% of tdd-refactor¶
| arm | mean Δlines | vs tdd-refactor | Condition 2 |
|---|---|---|---|
| test-after-refactor | 678 | +2.1% | ✓ |
| tdd-refactor | 664 | — | — |
| test-after | 700 | +5.4% | — |
Condition 2 HOLDS — test-after-refactor blast radius (678 lines) is within 2.1% of tdd-refactor (664 lines), well inside the 10% tolerance.
Clarification — the two refactor arms are equivalent on changeability, and that is consistent with "refactoring is the mechanism," not a contradiction of it. It is tempting to read tdd-refactor (664) < test-after-refactor (678) as "test-first ordering buys extra changeability on top of refactoring." The data does not support that reading. The 14-line gap is +2.1% — inside the 10% tolerance and, at n=3 per cell, inside the noise. The robust, consistent signal in this dataset is refactor vs no-refactor, not TDD vs test-after:
| refactor step | mean Δlines | |
|---|---|---|
| tdd-refactor | yes | 664 |
| test-after-refactor | yes | 678 |
| test-after | no | 700 |
| tdd-no-refactor | no | 701 |
Both refactor arms (664, 678) sit ~5–6% below both non-refactor arms (700, 701), and adding a refactor pass to test-after moved it 700 → 678, into the same band as tdd-refactor. Test-first ordering does not separate from test-after once both refactor. The headline mechanism-isolation hypothesis finding — the green→refactor cycle, not test ordering, drives changeability — holds, and the refactor-arm tie is a second confirmation of it.
The residual 14 lines, if it is real rather than noise, has two candidate explanations, neither established by this experiment:
- Refactoring granularity. tdd-refactor refactors in small steps on every green, so cleanup tracks the design as it grows; test-after-refactor does one cleanup pass after the whole component is built. A late lump-sum refactor may have less leverage than many incremental ones.
- Safety-net erosion. test-after-refactor's refactor phase rewrites or deletes its own tests (the same effect that collapsed its EDGE coverage to 0% under vague spec, Condition 1). A churned test suite is a weaker safety net when the three follow-up changes land, which could nudge later blast radius up.
Distinguishing "real but small effect" from "noise," and adjudicating between these two mechanisms, requires a dedicated higher-power experiment — see Proposed follow-up: refactoring cadence.
Condition 3: Cost per stage < tdd-refactor¶
| arm | mean cost/stage |
|---|---|
| test-after | $0.19 |
| tdd-no-refactor | $0.22 |
| bduf | $0.24 |
| test-after-refactor | $0.35 |
| tdd-refactor | $0.44 |
Condition 3 HOLDS — test-after-refactor ($0.35/stage) is 20% cheaper than tdd-refactor ($0.44/stage).
dominance hypothesis Overall Verdict¶
dominance hypothesis NOT SUPPORTED. Condition 1 fails decisively: under a vague spec, test-after-refactor produces 0% EDGE pass rate (pooled across 3 of 4 tasks), worse than both test-after (67%) and tdd-refactor (33%). The refactoring phase degrades edge-case coverage when the spec is ambiguous — the agent refactors away the tests that document its own decisions.
Conditions 2 and 3 both hold: the blast radius is equivalent to tdd-refactor (+2.1%) and the cost is 20% lower. But the EDGE failure dominates.
The mechanism: Under vague spec, test-after-refactor suffers from an ordering problem that test-after avoids. In test-after, tests written against the working implementation capture the agent's edge-case choices and stay in place. In test-after-refactor, those tests are then exposed to a refactor phase that rewrites the implementation — the agent often also rewrites or removes tests it considers redundant, stripping out the documented edge-case decisions. The result is structurally clean code with no EDGE coverage.
Revised decision guide (incorporating second-run results):
| Situation | Recommended workflow |
|---|---|
| Vague requirements | Stop and clarify first. No workflow recovers omitted decisions — including agentic spec synthesis: ship's /specs scored 25% EDGE under vague, no better than tdd-refactor (33%). |
| Clear requirements, long-lived codebase | tdd-refactor (lowest blast radius, best structural benefit) |
| Clear requirements, one-shot or cost-sensitive | test-after (same EDGE quality as tdd-refactor under clear spec, 2.3× cheaper) |
| Want refactoring benefits without TDD overhead | test-after-refactor (clear spec only — vague spec destroys edge-case coverage during refactor) |
| Speed-first, throwaway code | tdd-no-refactor or test-after (same changeability, lowest cost) |
Proposed follow-up: refactoring cadence and the test safety net¶
Status: proposed, not yet run. Pre-registration drafted; no data collected.
The second run left one question open (see the Condition 2 clarification). tdd-refactor (664) and test-after-refactor (678) are statistically equivalent on changeability at n=3 — the 14-line gap is inside the noise. Two questions remain:
- Is the gap real at all, or does it vanish under adequate statistical power?
- If real, which mechanism produces it — refactoring granularity (many small in-loop refactors vs one post-hoc pass) or safety-net erosion (the one-shot refactor churning its own test suite)?
Hypotheses (to pre-register before any data)¶
- equivalence hypothesis: continuous-refactor and one-shot-refactor workflows are equivalent on cumulative blast radius within a ±5% margin (TOST). This is the default the second-run data points to; the experiment is powered to reject it if a real effect exists.
- granularity hypothesis: holding the test-protection factor constant, continuous refactoring yields lower blast radius than one-shot refactoring. Predicts the gap survives even when tests are protected from churn.
- safety-net hypothesis: freezing the test suite during the refactor raises EDGE pass rate (tests that document edge decisions survive) and lowers subsequent blast radius relative to a free refactor. Predicts test-suite churn mediates the blast-radius difference.
granularity hypothesis and safety-net hypothesis are not mutually exclusive; the design separates their contributions.
Design — a 2×2, adequately powered¶
Cross refactor granularity × test protection during refactor, with tdd-refactor as an external reference:
| tests free to change in refactor | tests frozen during refactor | |
|---|---|---|
| one-shot refactor (single pass after build) | test-after-refactor (current arm) | test-after-refactor-frozen |
| continuous refactor (refactor each increment) | test-after-continuous | test-after-continuous-frozen |
Plus tdd-refactor (continuous, test-first) as the reference point from the prior runs. Reuse the same 4 tasks, vague and clear spec, 3-change chain.
Power. The n=3 cells could not resolve a 2% effect. Estimate the per-cell blast-radius SD from the existing run1+run2 data, then size n for 80% power to detect a 5% difference (and to make the ±5% TOST equivalence test meaningful). Expect this to require roughly n = 12–15 per cell rather than 3; pre-register the exact n from the power calc.
Instrumentation (new, per stage). The current harness records blast radius and pass-rates; add:
- refactor granularity (actual): count of distinct refactor edits between first-green and stage-complete — verifies the assigned arm behaved as intended.
- test-suite churn during refactor: test LOC added + deleted between first-green and post-refactor. This is the mediator variable for safety-net hypothesis.
- carry forward CORE/EDGE pass rates and cost/stage.
Analysis plan (pre-registered)¶
- Headline: TOST equivalence test on cumulative blast radius, continuous vs one-shot, ±5% margin → resolves question 1 (real vs noise) directly, including a "confirmed equivalent" outcome as a valid result.
- granularity hypothesis: two-factor model on blast radius; granularity main effect with test-protection held constant.
- safety-net hypothesis: mediation — does test-suite churn account for the granularity/protection effect on blast radius? Plus the direct EDGE comparison frozen vs free (this also re-tests the test-after-refactor Condition 1 finding that free refactor destroys edge coverage under vague spec).
What each outcome would mean¶
| Result | Interpretation |
|---|---|
| TOST confirms equivalence | The refactor-arm tie is real; "refactoring is the mechanism" is the whole story, and how you refactor (granularity, ordering) does not move changeability. Strongest, simplest takeaway. |
| Granularity effect, no churn mediation | Continuous refactoring is independently better; recommend refactoring in small steps regardless of test ordering. |
| Churn mediation (safety-net hypothesis) | The cost of one-shot refactor is collateral test-suite damage; recommend protecting/regenerating the test suite across a refactor, and the test-after-refactor EDGE collapse and the changeability residual share one root cause. |
A null here is a publishable result: confirming that the two refactor workflows are genuinely interchangeable on changeability would let teams choose between them on the axes that did separate cleanly — cost and edge-case robustness under vague specs.