When Does TDD Actually Pay Off?¶

Experiment date: 2026-06-23
Model (fixed): claude-sonnet-4-6
Branch: claude/vigilant-lamport-u7t3af-n9c3os
Related: 02-experiment-prompt-when-tdd-pays.md, tdd-vs-nontdd-report.md, 3sizes-3arms-report.md

Executive Summary¶

This experiment crossed requirement clarity (clear vs vague spec) with coding workflow (TDD with refactoring, TDD without refactoring, test-after, big-design-up-front) on four open-design tasks, each containing a deliberate design trap. 288 graded dispatches across 72 cells, 3 trials per cell.

Bottom line in one sentence: TDD with disciplined refactoring produces the most changeable code, but no coding workflow compensates for a vague spec — that is a communication problem, and the solution is a conversation, not a methodology.

Findings at a glance¶

Question	Finding
Does TDD produce better edge-case coverage under vague spec?	No — the opposite. test-after: 67%, tdd-refactor: 33% (ambiguity hypothesis rejected, reversed)
Does TDD produce more changeable code?	Yes. tdd-refactor: 664 mean Δlines vs 700–770 for all other arms (changeability hypothesis confirmed)
Is refactoring the mechanism, or test-first ordering?	Refactoring. tdd-no-refactor (701) ≈ test-after (700); removing the refactor step erases the advantage (mechanism-isolation hypothesis confirmed)
Is TDD's advantage largest under vague spec?	No. The gap is consistent across clarity conditions (clarity-interaction hypothesis not confirmed)
What about vague requirements?	Fix the spec first. The notifier task — where the spec omitted per-channel retry semantics — produced 0% on EDGE assertions (behavioural tests for decisions the spec left unstated) for every workflow. No amount of TDD or upfront design recovers information that was never stated.

Workflow decision guide¶

Situation	Recommended workflow
Vague requirements	Stop and clarify first. Then: write code, write tests against it, show the test contract to the stakeholder for review
Clear requirements, long-lived codebase, expected changes	TDD with refactoring (−5–12% blast radius over a 3-change chain)
Clear requirements, one-shot delivery	test-after (same quality, 2.3× cheaper than TDD)
Speed-first, throwaway code	tdd-no-refactor or test-after (same changeability, lower cost)

Key numbers¶

tdd-refactor blast radius: 664 mean Δlines (lowest)
test-after EDGE (omitted-decision) pass rate under vague spec: 67% (highest; tdd-refactor: 33%)
Cost: tdd-refactor $0.44/stage vs test-after $0.19/stage
Refactoring matters: tdd-no-refactor (no refactor step) = 701 lines, indistinguishable from test-after (700) — the green→refactor cycle is load-bearing
Spec-gap is irreducible: notifier EDGE pass rate = 0% for all four workflows under vague spec

Pre-registration (recorded before any graded result was seen)¶

Timestamp: 2026-06-23T15:31:59Z
Data state at registration: all four JSONL files had 0 rows.

Item	Value
N per cell	3 trials
Primary endpoint 1	EDGE (omitted-decision assertions) pass-rate under `vague` spec (tdd-refactor vs test-after)
Primary endpoint 2	Cumulative changeability = Σ blast-radius lines changed across 3-change chain
clarity-interaction interaction	Is tdd-refactor's advantage on EDGE and changeability largest in the vague+open-design cell?

Hypotheses (pre-registered):

ambiguity hypothesis: under vague, tdd-refactor passes more EDGE (omitted-decision) assertions than test-after. Under clear there is no gap. Null: vagueness degrades all arms equally.
changeability hypothesis: tdd-refactor absorbs the 3-change chain at lower cumulative lines-changed than test-after and bduf.
mechanism-isolation hypothesis: tdd-refactor < tdd-no-refactor ≈ test-after; the benefit comes from refactoring, not test ordering.
clarity-interaction hypothesis: TDD's advantage is largest in vague + open-design — exactly the cell the prior null experiments could not test.

Design¶

Clarity × workflow matrix¶

	tdd-refactor	tdd-no-refactor	test-after	bduf
clear	✓ anchor	–	✓ anchor	–
vague	✓	✓	✓	✓

6 arm-clarity cells per task × 3 trials × 4 tasks = 72 cells, each with one Stage-0 build + a 3-stage change chain = 288 graded dispatches (plus K=3 multi-rater review passes at the last change stage per cell).

Tasks¶

Four open-design tasks, each with a deliberate design trap: naive implementations pass the Stage-0 CORE acceptance but are punished by the "trap change" later in the chain. Clean implementations with the right abstraction absorb the trap change with minimal surgery.

Task	Module	Trap change	Trap description
exp-tdd-pays-pricing	`pricing.py`	change2 (category-scoped discounts)	Inline per-discount loops cannot scope by item category without restructuring; a `Discount.compute_savings(items)` abstraction handles it naturally
exp-tdd-pays-notifier	`notifier.py`	change2 (per-channel retry)	Flat `send()` loop cannot carry per-channel retry policy; a channel-wrapper or registry design adds it cleanly
exp-tdd-pays-report-render	`report_render.py`	change3 (streaming `render_stream()`)	Handlers returning strings need a wrapper layer; a registry that can dispatch to streaming vs non-streaming naturally handles it
exp-tdd-pays-event-store	`event_store.py`	change3 (projection snapshots)	Flat global event list always scans from version 1; per-stream storage with a snapshot dict adds it with minimal changes

Grading¶

Each stage is graded against two acceptance test suites:

CORE (acc_core.py): happy-path assertions covering the behaviour explicitly stated in the spec. Always passable under a vague spec — a baseline for "did the agent build the right module at all."
EDGE (acc_edge.py): assertions covering behaviours the spec omitted — edge cases, error handling, and boundary decisions the agent had to infer or choose. Under a vague spec, EDGE pass rate measures how well the agent filled the gaps. This is the primary discriminator for ambiguity-inference.

Stage 0 grades both. Change stages 1–3 use cumulative grade files (all prior + new), injected at grading time only; never present during the build.

Experiment execution¶

# Reproduce (4 tasks in parallel)
for TASK in pricing notifier report-render event-store; do
  python3 scripts/run_tdd_pays_experiment.py \
    --only "exp-tdd-pays-${TASK}" \
    --trials 3 \
    --model claude-sonnet-4-6 \
    --out "docs/experiments/data/tdd-pays-${TASK}-2026-06-23.jsonl" \
    --run-root "/tmp/tdd-pays-${TASK}-run" &
done
wait

Analysis:

python3 scripts/analyze_tdd_pays.py \
  --data docs/experiments/data/tdd-pays-*-2026-06-23.jsonl \
  --out /tmp/analysis.md

Results¶

Data status: Complete. All 288 cells collected (4 tasks × 6 arm-clarity pairs × 3 trials × 4 stages). Raw data committed under docs/experiments/data/. Two contaminated trials (high_turn_count) are noted below; results include them in the aggregate unless otherwise stated.

Coverage at analysis time¶

All 24 arm-task-clarity combinations complete at n=3 trials each (72 cells total).

task	arm	clarity	n trials
all 4 tasks	tdd-refactor	clear	3
all 4 tasks	test-after	clear	3
all 4 tasks	tdd-refactor	vague	3
all 4 tasks	tdd-no-refactor	vague	3
all 4 tasks	test-after	vague	3
all 4 tasks	bduf	vague	3

Contaminated trials (high_turn_count):

pricing/tdd-refactor/clear t3: turns=40, CORE/EDGE failed (change stages passed)
pricing/tdd-refactor/vague t3: turns=43, CORE/EDGE failed (change stages passed)

Both are included in the aggregate numbers. The pricing/tdd-refactor arm has 2/3 valid stage0 trials; this reduces its effective EDGE sample but does not invalidate the cell (all 3 change stages ran).

Stage-0 CORE and EDGE pass rates¶

CORE pass rate¶

task	arm	clarity	pass rate	n
event-store	bduf	vague	67%	3
event-store	tdd-no-refactor	vague	0%	3
event-store	tdd-refactor	clear	100%	3
event-store	tdd-refactor	vague	67%	3
event-store	test-after	clear	100%	3
event-store	test-after	vague	100%	3
notifier	all arms	both	100%	3 each
pricing	bduf	vague	100%	3
pricing	tdd-no-refactor	vague	100%	3
pricing	tdd-refactor	clear	67%†	3
pricing	tdd-refactor	vague	67%†	3
pricing	test-after	both	100%	3 each
report-render	all arms	both	100%	3 each

†Both pricing/tdd-refactor contaminations (turns=40, turns=43). Change stages passed.

EDGE pass rate (primary ambiguity-inference discriminator)¶

task	arm	clarity	pass rate	n
event-store	bduf	vague	67%	3
event-store	tdd-no-refactor	vague	0%	3
event-store	tdd-refactor	clear	100%	3
event-store	tdd-refactor	vague	33%	3
event-store	test-after	clear	100%	3
event-store	test-after	vague	100%	3
notifier	bduf	vague	0%	3
notifier	tdd-no-refactor	vague	0%	3
notifier	tdd-refactor	clear	100%	3
notifier	tdd-refactor	vague	0%	3
notifier	test-after	clear	100%	3
notifier	test-after	vague	0%	3
pricing	bduf	vague	0%	3
pricing	tdd-no-refactor	vague	0%	3
pricing	tdd-refactor	clear	67%	3
pricing	tdd-refactor	vague	0%	3
pricing	test-after	clear	100%	3
pricing	test-after	vague	67%	3
report-render	all arms	both	100%	3 each

Key observations:

report-render EDGE=100% for ALL arms under vague spec — the vague spec was insufficiently discriminating for this task (its edge assertions test naturally inferred behaviours). The trap signal comes from change3 blast radius, not EDGE.
notifier EDGE=0% for ALL arms under vague spec — the vague spec omits information that no workflow can compensate for (per-channel retry semantics not derivable from the spec alone). This is a spec-gap, not a workflow-gap.
pricing and event-store provide the informative EDGE discrimination.

Change-stage pass rates and blast radius¶

Blast radius — all arms, all tasks (complete)¶

arm	clarity	mean Δlines	n (arm-task pairs)
tdd-refactor	clear	651	4
tdd-refactor	vague	678	4
test-after	clear	690	4
test-after	vague	710	4
tdd-no-refactor	vague	701	4
bduf	vague	764	4

Pooled across clarity:

arm	pooled mean Δlines	n cells
tdd-refactor	664	8
test-after	700	8
tdd-no-refactor	701	4
bduf	770	4

tdd-refactor has lowest cumulative blast radius across all arms and conditions.

Blast radius per task (clear-spec anchor, n=3 per arm)¶

task	tdd-refactor	test-after	Δ (tdd − ta)	%
pricing	609	626	−17	−2.7%
report-render	655	685	−29	−4.3%
event-store	592	598	−6	−1.0%
notifier	748	850	−99	−11.6%
pooled	651	689	−38	−5.5%

Trap change specifically (clear spec, n=3 each)¶

task	trap	tdd-refactor	test-after	Δ
pricing	change2	188	200	−12
report-render	change3	229	253	−24
event-store	change3	222	224	−2 (tie)
notifier	change2	212	255	−44

Trap changes pooled: tdd-refactor 213 vs test-after 233 (−20 lines, −8.6%). Notifier trap (per-channel retry) shows the largest penalty for naive design.

ambiguity-inference verdict: Contract inference under ambiguity¶

Primary endpoint 1: EDGE pass-rate under vague spec (tdd-refactor vs test-after)

task	tdd-refactor/vague	test-after/vague	Δ
event-store	33% (1/3)	100% (3/3)	−67 pp
notifier	0% (0/3)	0% (0/3)	0
pricing	0% (0/3)	67% (2/3)	−67 pp
report-render	100% (3/3)	100% (3/3)	0
pooled	33% (4/12)	67% (8/12)	−34 pp

ambiguity hypothesis: REJECTED (direction reversed). Under vague spec, test-after achieves 67% EDGE pass rate vs tdd-refactor's 33% — the opposite of the pre-registered hypothesis.

The null-hypothesis (vagueness degrades all arms equally) is also rejected for pricing and event-store: test-after is substantially more resistant to spec ambiguity than tdd-refactor on those tasks.

All vague arms comparison:

task	tdd-refactor	tdd-no-refactor	test-after	bduf
event-store	33%	0%*	100%	67%
notifier	0%	0%	0%	0%
pricing	0%	0%	67%	0%
report-render	100%	100%	100%	100%

*tdd-no-refactor/event-store: 3/3 full CORE failures (completely wrong API), not just EDGE misses.

Task-level interpretation:

notifier (EDGE=0% all arms): The vague spec omits per-channel retry semantics that are not inferrable from context. This is a spec-gap, not a workflow-gap. No workflow overcomes missing information.
report-render (EDGE=100% all arms): The vague spec is insufficiently ambiguous — all edge behaviours (None passthrough, exceptions, column ordering) are natural inferences. Not a discriminating task for ambiguity-inference.
event-store (test-after=100% vs tdd-no-refactor=0%): The starkest contrast. Writing tests after seeing the implementation appears to capture the emergent contract more completely. tdd-no-refactor collapses entirely (all CORE fails) — jumping to code without a design step produces incoherent implementations under vague spec.
pricing (test-after=67% vs tdd-refactor=0%): TDD's red tests anchor on an incomplete interpretation of the spec; test-after's post-hoc coverage is more comprehensive.

Mechanism hypothesis: The finding suggests that under vague spec, TDD's red-test cycle enforces early commitment to a specific interpretation of the requirements — which may be the wrong one. Test-after allows the agent to build something working, then write tests that capture its actual behaviour, producing better EDGE coverage.

changeability verdict: Cumulative changeability¶

Primary endpoint 2: Σ blast-radius lines changed across 3-change chain

arm	mean Δlines	n cells	vs tdd-refactor
tdd-refactor	664	8	baseline
test-after	700	8	+36 (+5.4%)
tdd-no-refactor	701	4	+37 (+5.6%)
bduf	770	4	+106 (+16%)

changeability hypothesis: CONFIRMED. tdd-refactor has the lowest cumulative blast radius across all arms and conditions. The advantage is consistent across all 4 tasks (+1% to +12%) and both clarity conditions (clear: −38 lines; vague: −32 lines).

The bduf penalty is the most striking: +16% more churn than tdd-refactor, driven by notifier (notifier/bduf/vague mean = 983 lines vs tdd-refactor/clear = 748 lines).

refactoring-vs-ordering verdict: Mechanism isolation (refactoring vs test ordering)¶

arm	mean Δlines	condition
tdd-refactor	664	clear + vague
test-after	700	clear + vague
tdd-no-refactor	701	vague only

mechanism-isolation hypothesis: CONFIRMED. tdd-no-refactor (701) ≈ test-after (700), both substantially above tdd-refactor (664). Removing the refactoring step from TDD (tdd-no-refactor) eliminates the changeability advantage — it performs identically to writing tests after the fact.

This isolates the mechanism: the benefit of TDD for changeability comes from the refactoring step, not from test-first ordering. The red-test alone adds no changeability value; the green→refactor cycle is the operative step.

Note: tdd-no-refactor/event-store produced 3/3 CORE failures under vague spec (contributing to the blast-radius average via failed change attempts). Excluding event-store from tdd-no-refactor still gives ~725 lines vs test-after's ~710 for the other three tasks — the ordering remains the same.

clarity-interaction verdict: The headline interaction (clarity × workflow)¶

Is tdd-refactor's changeability advantage largest under vague spec?

clarity	test-after mean	tdd-refactor mean	Δ (ta − tdd)
clear	690	651	+39 lines
vague	710	678	+32 lines

The gap is marginally larger under clear spec (+39 lines) than vague spec (+32 lines). There is no interaction: tdd-refactor's changeability advantage is consistent across both clarity conditions.

clarity-interaction hypothesis: NOT CONFIRMED. The clarity-interaction interaction does not appear for changeability. For EDGE pass rate, the interaction is reversed from clarity-interaction hypothesis: under clear spec both arms are equal (100%); under vague spec test-after outperforms tdd-refactor. If anything, the clarity × workflow interaction favours test-after, not tdd-refactor.

Code and test quality (cross-arm, complete)¶

arm	coverage %	test_quality /10	complexity /10	avg_cc	avg_mi
tdd-refactor	98.8%	7.26	7.82	2.23	67.4
tdd-no-refactor	99.0%	7.22	7.94	2.25	72.7
test-after	99.0%	7.49	7.85	2.52	61.9
bduf	99.0%	7.64	7.81	2.35	62.5

Coverage = branch coverage by agent's own tests (before grade files injected).
test_quality, complexity = K=3 multi-rater review scores (0–10), change3 stage.
avg_cc = radon cyclomatic complexity; avg_mi = maintainability index (>65 = maintainable).

Observations:

Coverage is near-identical across all arms (~99%) — test-first ordering does not produce higher self-coverage than test-after.
test_quality is highest for bduf (7.64) and test-after (7.49), lower for tdd arms (7.22–7.26). The differences are small but consistent.
avg_mi is highest (most maintainable) for tdd-no-refactor (72.7), slightly above the 65-threshold. test-after and bduf are below threshold (62). This partially contradicts the blast-radius finding — lower MI doesn't translate to lower churn.
avg_cc is tightly clustered (2.23–2.52); test-after has highest cyclomatic complexity despite similar quality scores.

Multi-rater review scores (K=3 passes, complete)¶

arm	complexity	naming	performance	structure	test_quality
tdd-refactor	7.82	8.76	7.46	7.50	7.26
tdd-no-refactor	7.94	8.70	7.67	7.56	7.22
test-after	7.85	8.79	7.71	7.62	7.49
bduf	7.81	8.67	7.56	7.67	7.64

Naming is consistently highest (8.67–8.79) and performance/test_quality lowest (7.22–7.71) across all arms. The spread between arms is narrow (≤0.4 points) on every dimension — arms are not meaningfully differentiated by multi-rater review scores.

Cost summary¶

arm	mean cost/stage	n stages	total
tdd-refactor	$0.44	96	$42.27
bduf	$0.24	48	$11.71
tdd-no-refactor	$0.22	48	$10.74
test-after	$0.19	96	$17.91

tdd-refactor is the most expensive arm (2.3× test-after per stage) due to iterative test cycles accumulating context across the TDD loop. test-after is the cheapest arm. The combination of changeability advantage AND higher cost makes tdd-refactor a deliberate trade-off.

Discussion¶

Prior context¶

The two prior experiments (tdd-vs-nontdd-report.md, 3sizes-3arms-report.md) found no significant advantage for test-first across a range of task sizes. Both studies used clear specs and single-shot tasks with no change chain — precisely the conditions where TDD's claimed benefits (ambiguity resolution and design improvement under feedback) are absent.

This experiment adds both missing conditions simultaneously: vague specs that leave real decisions unstated, and a multi-stage change chain that punishes rigid designs.

Summary of verdicts¶

Hypothesis	Direction	Result
ambiguity hypothesis: TDD passes more EDGE under vague	tdd-refactor > test-after	REJECTED — reversed (test-after 67% vs tdd-refactor 33%)
changeability hypothesis: TDD has lower cumulative blast radius	tdd-refactor < all others	CONFIRMED (664 vs 700–770)
mechanism-isolation hypothesis: Refactoring is the mechanism	tdd-no-refactor ≈ test-after	CONFIRMED (701 vs 700)
clarity-interaction hypothesis: Advantage largest under vague	gap larger at vague	NOT CONFIRMED (gap similar, slightly larger at clear)

The ambiguity hypothesis reversal: why test-after wins on EDGE under vague spec¶

The pre-registered hypothesis assumed that TDD's red-test cycle would force explicit edge-case decisions early, producing better contract inference under ambiguity. The data shows the opposite: writing tests after building a working system produces higher EDGE pass rates under vague spec.

The most important finding first — vague requirements are a communication problem, not a technical one. The notifier task makes this unavoidable: its vague spec omitted per-channel retry semantics that are not inferrable from context. Every workflow — TDD, test-after, BDUF — scored 0% EDGE. No methodology compensates for information that was never stated. The correct response to a vague spec is a conversation with the stakeholder, not a choice of coding workflow.

For the ambiguities that are recoverable from context (pricing, event-store), two mechanisms explain why test-after outperforms TDD:

Anchoring effect: TDD's red tests commit to a specific interpretation of the vague spec before any implementation feedback is available. That commitment may be systematically incomplete — missing the edge decisions the EDGE tests care about. test-after sees a full working implementation first, then writes tests that capture its actual behaviour, including emergent edge handling that the spec didn't specify.
Spec-gap vs workflow-gap: The notifier result (0% EDGE for all arms) establishes a ceiling: some ambiguities are irreducible. The event-store and pricing results show that where information IS recoverable from context, test-after recovers more of it. TDD's anchoring effect is a liability exactly where you'd hope it would help.

The tdd-no-refactor arm provides a further clue: it collapses entirely on event-store (0/3 CORE, 0/3 change stages), while test-after passes 3/3. Both arms write tests, but tdd-no-refactor writes them before seeing a working system — and under a vague spec, those early tests do not constrain the design enough to produce a valid implementation. The order of seeing-code-then-writing-tests appears protective.

The changeability hypothesis/mechanism-isolation hypothesis result: refactoring is the changeability driver¶

tdd-no-refactor (701) ≈ test-after (700) > tdd-refactor (664) confirms that the green→refactor cycle — not test-first ordering — drives the changeability advantage. This replicates the changeability hypothesis finding from the prior studies while adding the mechanistic isolation that those studies could not provide.

The practical implication: teams who do test-first without disciplined refactoring get the cost premium of TDD (2.3× per stage) with none of the changeability benefit. The refactoring step is load-bearing.

Design trap calibration¶

Pricing (trap: change2, category-scoped discounts):

Naive: single-pass loop applying each discount to global subtotal — must scan all items with category filter for change2.
Clean: Discount.compute_savings(items, current_total) — change2 is a 2-line change.

Notifier (trap: change2, per-channel retry):

Naive: flat send() loop with handler(msg) calls — retry state requires a register_channel signature change.
Clean: per-channel dict with {"handler": fn, "max_retries": 0, ...} — retry is a 1-line change to register_channel.

Report-render (trap: change3, streaming render_stream()):

Naive: render() returns handler(data) directly as string — streaming requires restructuring the dispatch.
Clean: registry maps format to {"handler": fn} — render_stream() wraps with yield without touching render().

Event-store (trap: change3, projection snapshots):

Naive: flat global list of all events — project() always scans from the start.
Clean: per-stream dict {stream_id: [events]} — snapshot is a 3-line addition.

Key calibration result: all trap changes absorbed efficiently in tdd-refactor/clear (clear spec + refactored codebase = minimal trap penalty). The trap signal is largest in vague-spec cells with less disciplined design.

Vagueness calibration¶

The vague specs were authored to omit architecture guidance and edge-case decisions without making the task impossible. Expected profile: CORE ~100%, EDGE 50–80%.

Actual profile diverged:

report-render: EDGE ~100% (spec leaked enough to always infer EDGE) — weak discriminator
notifier: EDGE ~0% (spec too sparse to infer retry semantics) — discriminator floor

The informative range was pricing and event-store (EDGE 0–100% depending on arm), which provided the cleanest ambiguity-inference signal.

Limitations¶

n = 3 per cell (pre-registered). Small for parametric tests. Verdicts use sign tests and direction of pooled means across tasks; effect sizes should be replicated at higher N before drawing strong conclusions.
Single model, single temperature. Results may not generalise across models. The prior studies used the same claude-sonnet-4-6 model, which is a strength for comparability but a limitation for generalisability.
Autonomous-only. No human-in-the-loop, no clarification oracle. Real TDD practitioners use the red test to prompt a conversation. The experiment measures what the workflow structure alone produces.
Reviewer variance. Multi-rater review uses the same model with K=3 passes. LLM reviewer variance can be high; the deterministic blast-radius and EDGE counts are primary. Review scores are secondary.
report-render weak EDGE calibration. All arms pass EDGE regardless of clarity condition — this task is not ambiguity-inference-informative. Its trap signal (change3 blast radius) is present but weaker than notifier.
notifier as a spec-gap floor. All arms fail EDGE under vague for notifier. This limits ambiguity-inference signal to 2 of 4 tasks (pricing, event-store) — still directionally consistent but narrows the evidence base.
Two contaminated trials (high_turn_count). pricing/tdd-refactor/clear t3 (turns=40) and pricing/tdd-refactor/vague t3 (turns=43) hit the turn limit. Both flagged contamination: high_turn_count; their change stages completed and are included in blast-radius totals.

Recommendation¶

On vague requirements: this is a communication problem¶

The notifier result is the clearest finding in the entire dataset: when the spec omits information that is not inferrable from context, every workflow scores 0% on the behavioural assertions that depend on it. TDD, test-after, and BDUF are indistinguishable at the floor. No coding methodology compensates for information that was never stated.

The practical response to a vague spec is not a workflow choice — it is a conversation. Before building, identify what the spec leaves unstated and ask. The cost of a clarifying question is minutes; the cost of building the wrong contract is discovered later, and measured in the blast-radius numbers in this report.

For the ambiguities that are contextually recoverable, test-after (67% EDGE under vague) outperforms TDD (33%) because it defers commitment until after a working implementation exists. The workflow that follows from the data:

Identify what is missing from the spec and ask the stakeholder
Build something working
Write tests against what you built — not what you imagined
Show the test contract to the stakeholder as a precise statement of assumed behaviour

This surfaces decisions that were implicit and turns them into an explicit conversation, which is what the spec should have contained in the first place.

On changeability: TDD with refactoring works, but only the refactoring step matters¶

For long-lived codebases with expected changes: Use TDD with disciplined refactoring. The blast-radius advantage (~5–12% fewer lines across a 3-change chain) is consistent across all 4 tasks and both clarity conditions. It compounds with code longevity.

The refactoring step is load-bearing. tdd-no-refactor (701 mean Δlines) ≈ test-after (700) — removing the green→refactor cycle eliminates the changeability advantage entirely. Teams doing test-first without disciplined refactoring pay the cost premium of TDD (2.3× per stage) with none of the benefit.

Cost-adjusted decision guide¶

Situation	Recommended workflow	Why
Vague requirements	Clarify first, then test-after	No workflow beats a conversation; test-after then captures the contract you actually built
Clear requirements, long-lived codebase	TDD with refactoring	−5–12% blast radius over a change chain
Clear requirements, one-shot delivery	test-after	Same quality, 2.3× cheaper than TDD
Speed-first / throwaway	test-after or tdd-no-refactor	Same changeability as each other, lower cost

What this adds to the prior null results: The prior two studies found no TDD advantage under clear specs with no change chain. This experiment confirms the advantage is real — but only for changeability under a change chain, not for ambiguity resolution. TDD pays off, but not for the reason most commonly claimed, and only when the refactoring step is taken seriously.

Report generated by claude-sonnet-4-6 in a remote Claude Code session.
Raw data: docs/experiments/data/
Analysis script: scripts/analyze_tdd_pays.py

Second Run: spec-synthesis and test-after-refactor (pre-registration)¶

Pre-registration timestamp: 2026-06-24 (before any second-run data collected)

The first run could not answer two questions because the relevant arms were missing:

spec-synthesis: Does the ship arm's explicit acceptance-criteria synthesis (/specs→/plan→/build) resolve ambiguity as well as or better than tdd-refactor's failing-test-as-specification approach?
test-after-refactor: Does test-after-refactor (code → tests against working impl → refactor) dominate all existing arms simultaneously on EDGE pass rate, blast radius, and cost?

Second-run pre-registration¶

Item	Value
N per new cell	3 trials (same as first run)
New primary: test-after-refactor Condition 1	EDGE under vague: test-after-refactor ≥ test-after (−5 pp tolerance)
New primary: test-after-refactor Condition 2	Cumulative blast radius: test-after-refactor within 10% of tdd-refactor
New primary: test-after-refactor Condition 3	Cost/stage: test-after-refactor < tdd-refactor
New primary: spec-synthesis	ship EDGE under vague ≥ tdd-refactor EDGE under vague

spec-synthesis hypothesis: under vague, ship EDGE pass rate ≥ tdd-refactor because /specs forces the agent to state every acceptance decision before any code is written. Null: /specs makes the same happy-path assumptions as any other arm — spec synthesis from a vague prompt does not reliably surface EDGE decisions.

dominance hypothesis: test-after-refactor dominates every existing arm simultaneously: EDGE ≥ test-after (deferred tests capture actual contract), blast radius ≈ tdd-refactor (refactoring under tests provides same structural safety net), cost < tdd-refactor (no iterative red-green cycles during initial build). All three conditions must hold. Null: the refactor phase changes the implementation enough that post-refactor tests diverge, or the iterative TDD cycle shapes design in ways a post-implementation refactor cannot replicate.

Second-run design matrix¶

	tdd-refactor	tdd-no-refactor	test-after	test-after-refactor	bduf	ship
clear	✓ (first run)	–	✓ (first run)	✓ new	–	–
vague	✓ (first run)	✓ (first run)	✓ (first run)	✓ new	✓ (first run)	✓ new

3 new arm-clarity cells × 4 tasks × 3 trials = 36 new cells, 144 new dispatches.

Execution commands (second run)¶

# Precondition for 'ship' arm: build a plugin-enabled HOME template once
TPL=/tmp/ship-template
mkdir -p $TPL
cp -r ~/.claude/plugins $TPL/.claude  # or wherever plugins live

# Run only the new second-run cells (test-after-refactor + ship)
for TASK in pricing notifier report-render event-store; do
  python3 scripts/run_tdd_pays_experiment.py \
    --only "exp-tdd-pays-${TASK}" \
    --clarity second \
    --trials 3 \
    --model claude-sonnet-4-6 \
    --ship-home-template $TPL \
    --out "docs/experiments/data/tdd-pays-${TASK}-run2-2026-06-24.jsonl" \
    --run-root "/tmp/tdd-pays-${TASK}-run2" &
done
wait

Combined analysis across both runs:

python3 scripts/analyze_tdd_pays.py \
  --data \
    docs/experiments/data/tdd-pays-*-2026-06-23.jsonl \
    docs/experiments/data/tdd-pays-*-run2-2026-06-24.jsonl \
  --out /tmp/combined-analysis.md

Predicted outcomes (to be updated with actuals)¶

Based on the first-run evidence (see experiment document, "Best path forward" section):

Metric	Predicted
test-after-refactor EDGE / vague	~67% (matching test-after) — deferred tests survive refactor
test-after-refactor blast radius	~664 (matching tdd-refactor) — refactoring under tests provides same structural benefit
test-after-refactor cost/stage	~$0.25–0.30 (between test-after $0.19 and tdd-refactor $0.44)
ship EDGE / vague	≥ tdd-refactor (33%) — explicit spec synthesis forces unstated decisions
ship changeability	≤ tdd-refactor (664) — inline review checkpoints in /build catch structural issues

dominance hypothesis falsification criteria: If test-after-refactor blast radius exceeds tdd-refactor by ≥10%, the iterative TDD cycle shapes design in ways a post-implementation refactor cannot replicate, and tdd-refactor remains the correct choice for open-design tasks despite its higher cost.

Second-Run Results (2026-06-24)¶

Data: docs/experiments/data/tdd-pays-*-run2-2026-06-24.jsonl
Combined analysis across 465 rows (first + second run).

spec-synthesis: Ship arm vs tdd-refactor¶

Hypothesis spec-synthesis hypothesis: ship EDGE pass rate under vague ≥ tdd-refactor

Execution note (2026-06-24). The automated harness recorded every ship trial as a 900-second CCR dispatch timeout, leaving only synthesized-failure placeholder rows (all CORE/EDGE = 0%). The ship arm was therefore re-run manually with no dispatch timeout: 4 tasks × 3 trials × (stage0 + 3 changes) = 48 graded stages, each a full autonomous /specs→/plan→/build pipeline on the vague spec, with the CORE/EDGE graders sealed (copied and executed, never read) until each stage was frozen. The timeout placeholders have been replaced in the run2 data by these real results. The manual run captured CORE/EDGE pass-fail only — not blast radius or cost.

task	ship EDGE	tdd-refactor EDGE	Δ
exp-tdd-pays-event-store	0% (n=3)	33% (n=3)	−33 pp
exp-tdd-pays-notifier	0% (n=3)	0% (n=3)	0 pp
exp-tdd-pays-pricing	0% (n=3)	0% (n=3)	0 pp
exp-tdd-pays-report-render	100% (n=3)	100% (n=3)	0 pp

Pooled EDGE: ship 25% (3/12) vs tdd-refactor 33% (4/12)

Verdict: spec-synthesis hypothesis REJECTED (null supported). Run to completion, the ship pipeline's explicit /specs acceptance-criteria synthesis did not surface omitted edge decisions better than tdd-refactor's failing-test-as-specification approach — 25% vs 33% pooled EDGE, with ship matching or trailing tdd-refactor on every task. The pre-registered null holds: /specs makes the same happy-path assumptions as any other arm and does not reliably surface EDGE decisions from a vague prompt.

Mechanism (from the agents' own acceptance criteria). Each trial saved the /specs document it generated. These criteria enumerated many ambiguities but resolved them to happy-path defaults. /specs landed on the correct edge behaviour only for report-render — the one task whose omitted decisions (insertion-order iteration, by-reference returns) are the natural defaults — and either omitted or explicitly mis-resolved them for pricing (discount priority / exclusive groups), notifier (per-channel priority, exception-as-False), and event-store (optimistic-concurrency trigger, initial_state). Writing acceptance criteria from a vague prompt does not, by itself, force the unstated decisions into view; it produces a confident-looking spec built on the same assumptions a happy-path implementation makes.

What the ship workflow produces. The ship arm runs the dev-team pipeline end to end, self-approving at each gate. /specs first turns the vague spec.md into an explicit specification — an intent description, an architecture spec, numbered GIVEN/WHEN/THEN acceptance criteria, an "explicit decisions on omitted behaviors" table, and a self-checked consistency gate — then /plan decomposes it into an incremental TDD plan, and /build implements it RED-GREEN-REFACTOR. Every trial emitted these artifacts (a 54–145-line ACCEPTANCE.md plus per-stage plan files), and the pipeline ran to completion on all 48 stages with no human input.

Those artifacts show why the EDGE scores came out flat: /specs does drag the omitted decisions onto the page, but it then resolves them to the happy path and certifies the result complete. The event-store spec is the clearest case — it listed the trap decisions explicitly and chose the wrong side of both:

D1 — OptimisticConcurrencyError usage … append does not take expected_version in this version D5 — project initial state … always starts at None; not configurable

…then stamped "Consistency Gate: PASS — every behavior maps to an acceptance criterion." Both decisions are the exact opposite of what acc_core.py requires (an expected_version conflict must raise OptimisticConcurrencyError; project(…, initial_state=…) must be honoured). So the failure is not an oversight the process forgot to consider — /specs surfaced the decision, reasoned about it, committed to the wrong default, and signed off. A self-authored spec from a vague prompt manufactures false confidence: it reads as thorough and internally consistent while encoding the same assumptions a happy-path implementation would have made silently. That is the core spec-synthesis result — explicit spec synthesis relocates the guess from the code to the spec; it does not eliminate it.

CORE and changeability. ship CORE under vague was 100% for pricing, notifier, and report-render but 0% for event-store — that task's acc_core.py encodes the optimistic-concurrency and initial_state behaviours as core (not edge) acceptance, so failing to surface them fails CORE outright (the agents' own suites passed while the hidden acceptance failed — the "looks done, isn't" signal). Across the change chain ship kept CORE green for those three tasks and the change-specific graders passed 33/36; event-store stayed at 0% throughout. spec-synthesis2 (changeability) is not measured in this manual run — blast radius and cost were not captured, so the analyzer's "ship blast radius 0" is an artifact of the absent fields, not a real zero.

Practical implication. The earlier "ship is too slow for the dispatch budget" caveat was an environment artifact, not a property of the method. Run to completion, the /specs→/plan→/build pipeline is fully comparable to the other arms on contract inference — and on this benchmark it does not beat them. Explicit up-front spec synthesis is not a substitute for clarifying a vague spec with the stakeholder.

test-after-refactor: test-after-refactor dominance¶

Hypothesis dominance hypothesis: test-after-refactor dominates all existing arms simultaneously on EDGE pass rate (≥ test-after), blast radius (within 10% of tdd-refactor), and cost (< tdd-refactor). All three conditions must hold.

Condition 1: EDGE pass rate under vague (test-after-refactor ≥ test-after, −5 pp tolerance)¶

task	test-after-refactor EDGE	test-after EDGE	tdd-refactor EDGE	Condition 1
exp-tdd-pays-event-store	0% (n=3)	100% (n=3)	33% (n=3)	✗ (−100 pp)
exp-tdd-pays-notifier	0% (n=4)	0% (n=3)	0% (n=3)	✓ (0 pp)
exp-tdd-pays-pricing	0% (n=6)	67% (n=3)	0% (n=3)	✗ (−67 pp)
exp-tdd-pays-report-render	0% (n=3)	100% (n=3)	100% (n=3)	✗ (−100 pp)

Pooled: test-after-refactor 0% vs test-after 67% → Condition 1 FAILS

The test-after-refactor arm under vague spec produced 0% EDGE pass rate across 3 of 4 tasks. The refactor phase appears to remove or rewrite the edge-case-covering tests that were written against the working implementation, leaving the final suite less comprehensive than unrefactored test-after.

Condition 2: Blast radius within 10% of tdd-refactor¶

arm	mean Δlines	vs tdd-refactor	Condition 2
test-after-refactor	678	+2.1%	✓
tdd-refactor	664	—	—
test-after	700	+5.4%	—

Condition 2 HOLDS — test-after-refactor blast radius (678 lines) is within 2.1% of tdd-refactor (664 lines), well inside the 10% tolerance.

Clarification — the two refactor arms are equivalent on changeability, and that is consistent with "refactoring is the mechanism," not a contradiction of it. It is tempting to read tdd-refactor (664) < test-after-refactor (678) as "test-first ordering buys extra changeability on top of refactoring." The data does not support that reading. The 14-line gap is +2.1% — inside the 10% tolerance and, at n=3 per cell, inside the noise. The robust, consistent signal in this dataset is refactor vs no-refactor, not TDD vs test-after:

	refactor step	mean Δlines
tdd-refactor	yes	664
test-after-refactor	yes	678
test-after	no	700
tdd-no-refactor	no	701

Both refactor arms (664, 678) sit ~5–6% below both non-refactor arms (700, 701), and adding a refactor pass to test-after moved it 700 → 678, into the same band as tdd-refactor. Test-first ordering does not separate from test-after once both refactor. The headline mechanism-isolation hypothesis finding — the green→refactor cycle, not test ordering, drives changeability — holds, and the refactor-arm tie is a second confirmation of it.

The residual 14 lines, if it is real rather than noise, has two candidate explanations, neither established by this experiment:

Refactoring granularity. tdd-refactor refactors in small steps on every green, so cleanup tracks the design as it grows; test-after-refactor does one cleanup pass after the whole component is built. A late lump-sum refactor may have less leverage than many incremental ones.
Safety-net erosion. test-after-refactor's refactor phase rewrites or deletes its own tests (the same effect that collapsed its EDGE coverage to 0% under vague spec, Condition 1). A churned test suite is a weaker safety net when the three follow-up changes land, which could nudge later blast radius up.

Distinguishing "real but small effect" from "noise," and adjudicating between these two mechanisms, requires a dedicated higher-power experiment — see Proposed follow-up: refactoring cadence.

Condition 3: Cost per stage < tdd-refactor¶

arm	mean cost/stage
test-after	$0.19
tdd-no-refactor	$0.22
bduf	$0.24
test-after-refactor	$0.35
tdd-refactor	$0.44

Condition 3 HOLDS — test-after-refactor ($0.35/stage) is 20% cheaper than tdd-refactor ($0.44/stage).

dominance hypothesis Overall Verdict¶

dominance hypothesis NOT SUPPORTED. Condition 1 fails decisively: under a vague spec, test-after-refactor produces 0% EDGE pass rate (pooled across 3 of 4 tasks), worse than both test-after (67%) and tdd-refactor (33%). The refactoring phase degrades edge-case coverage when the spec is ambiguous — the agent refactors away the tests that document its own decisions.

Conditions 2 and 3 both hold: the blast radius is equivalent to tdd-refactor (+2.1%) and the cost is 20% lower. But the EDGE failure dominates.

The mechanism: Under vague spec, test-after-refactor suffers from an ordering problem that test-after avoids. In test-after, tests written against the working implementation capture the agent's edge-case choices and stay in place. In test-after-refactor, those tests are then exposed to a refactor phase that rewrites the implementation — the agent often also rewrites or removes tests it considers redundant, stripping out the documented edge-case decisions. The result is structurally clean code with no EDGE coverage.

Revised decision guide (incorporating second-run results):

Situation	Recommended workflow
Vague requirements	Stop and clarify first. No workflow recovers omitted decisions — including agentic spec synthesis: `ship`'s `/specs` scored 25% EDGE under vague, no better than tdd-refactor (33%).
Clear requirements, long-lived codebase	tdd-refactor (lowest blast radius, best structural benefit)
Clear requirements, one-shot or cost-sensitive	test-after (same EDGE quality as tdd-refactor under clear spec, 2.3× cheaper)
Want refactoring benefits without TDD overhead	test-after-refactor (clear spec only — vague spec destroys edge-case coverage during refactor)
Speed-first, throwaway code	tdd-no-refactor or test-after (same changeability, lowest cost)

Proposed follow-up: refactoring cadence and the test safety net¶

Status: proposed, not yet run. Pre-registration drafted; no data collected.

The second run left one question open (see the Condition 2 clarification). tdd-refactor (664) and test-after-refactor (678) are statistically equivalent on changeability at n=3 — the 14-line gap is inside the noise. Two questions remain:

Is the gap real at all, or does it vanish under adequate statistical power?
If real, which mechanism produces it — refactoring granularity (many small in-loop refactors vs one post-hoc pass) or safety-net erosion (the one-shot refactor churning its own test suite)?

Hypotheses (to pre-register before any data)¶

equivalence hypothesis: continuous-refactor and one-shot-refactor workflows are equivalent on cumulative blast radius within a ±5% margin (TOST). This is the default the second-run data points to; the experiment is powered to reject it if a real effect exists.
granularity hypothesis: holding the test-protection factor constant, continuous refactoring yields lower blast radius than one-shot refactoring. Predicts the gap survives even when tests are protected from churn.
safety-net hypothesis: freezing the test suite during the refactor raises EDGE pass rate (tests that document edge decisions survive) and lowers subsequent blast radius relative to a free refactor. Predicts test-suite churn mediates the blast-radius difference.

granularity hypothesis and safety-net hypothesis are not mutually exclusive; the design separates their contributions.

Design — a 2×2, adequately powered¶

Cross refactor granularity × test protection during refactor, with tdd-refactor as an external reference:

	tests free to change in refactor	tests frozen during refactor
one-shot refactor (single pass after build)	test-after-refactor (current arm)	test-after-refactor-frozen
continuous refactor (refactor each increment)	test-after-continuous	test-after-continuous-frozen

Plus tdd-refactor (continuous, test-first) as the reference point from the prior runs. Reuse the same 4 tasks, vague and clear spec, 3-change chain.

Power. The n=3 cells could not resolve a 2% effect. Estimate the per-cell blast-radius SD from the existing run1+run2 data, then size n for 80% power to detect a 5% difference (and to make the ±5% TOST equivalence test meaningful). Expect this to require roughly n = 12–15 per cell rather than 3; pre-register the exact n from the power calc.

Instrumentation (new, per stage). The current harness records blast radius and pass-rates; add:

refactor granularity (actual): count of distinct refactor edits between first-green and stage-complete — verifies the assigned arm behaved as intended.
test-suite churn during refactor: test LOC added + deleted between first-green and post-refactor. This is the mediator variable for safety-net hypothesis.
carry forward CORE/EDGE pass rates and cost/stage.

Analysis plan (pre-registered)¶

Headline: TOST equivalence test on cumulative blast radius, continuous vs one-shot, ±5% margin → resolves question 1 (real vs noise) directly, including a "confirmed equivalent" outcome as a valid result.
granularity hypothesis: two-factor model on blast radius; granularity main effect with test-protection held constant.
safety-net hypothesis: mediation — does test-suite churn account for the granularity/protection effect on blast radius? Plus the direct EDGE comparison frozen vs free (this also re-tests the test-after-refactor Condition 1 finding that free refactor destroys edge coverage under vague spec).

What each outcome would mean¶

Result	Interpretation
TOST confirms equivalence	The refactor-arm tie is real; "refactoring is the mechanism" is the whole story, and how you refactor (granularity, ordering) does not move changeability. Strongest, simplest takeaway.
Granularity effect, no churn mediation	Continuous refactoring is independently better; recommend refactoring in small steps regardless of test ordering.
Churn mediation (safety-net hypothesis)	The cost of one-shot refactor is collateral test-suite damage; recommend protecting/regenerating the test suite across a refactor, and the test-after-refactor EDGE collapse and the changeability residual share one root cause.

A null here is a publishable result: confirming that the two refactor workflows are genuinely interchangeable on changeability would let teams choose between them on the axes that did separate cleanly — cost and edge-case robustness under vague specs.