Consolidated Report: Build-Pipeline vs. TDD vs. Non-TDD across Small / Medium / Large Tasks¶
Status: Complete — one campaign, three sizes, three arms, model fixed.
Date: 2026-06-22
Model (fixed): claude-sonnet-4-6
Design + prior results: 01-experiment-prompt-3sizes-3arms.md,
tdd-vs-test-after-experiment.md,
tdd-vs-test-after-consolidated-report.md (prior campaign; not migrated into this docs set)
Runner: scripts/run_tdd_experiment.py
Analyzer: scripts/analyze_tdd_experiment.py
Raw data: data/3sizes-small-sonnet-2026-06-22.jsonl (small),
data/tdd-largetask-sonnet-2026-06-21.json (medium, folded in),
data/3sizes-large-sonnet-2026-06-22.jsonl (large),
data/3sizes-3arms-summary.json (cost/quality summary),
data/3sizes-review-findings.json (review-agent findings, large tier)
Executive summary — read this first¶
Three workflows — build-pipeline (the real dev-team /plan→/build),
test-first (strict RED-GREEN-REFACTOR), test-after (all code first, tests
last) — were compared on 18 tasks across three sizes (6 small katas, 6 medium
single-module features, 6 large multi-file packages), every cell isolated and
graded by hidden acceptance tests, model held fixed at claude-sonnet-4-6.
192 cells, 0 dispatch errors, 0 timeouts.
| Size | Correctness (all arms) | Cheapest arm | build-pipeline premium (× cheapest) | Quality separates? |
|---|---|---|---|---|
| small | 100% build | test-after ($0.198) | 4.74× ($0.94) | No (cov 100%, mut ≈1.0) |
| medium | 100% build | test-after ($0.215) | 2.57× ($0.55) | No (cov 100%, mut 1.0) |
| large | 100% build | test-after ($0.613) | 1.33× ($0.82) | Sensors finally cracked — but arms still tied |
Three findings, in order of strength:
-
The build-pipeline's cost premium collapses as tasks get bigger — 4.74× → 2.57× → 1.33×. Its fixed planning/review overhead is a huge multiplier on a one-function kata but a minor surcharge on a real multi-file feature. On the large tier the pipeline costs only 33% more than the cheapest hand-driven arm (paired median Δ=$0.18, sign p=0.031), at identical correctness and identical test quality. This is the first run where the pipeline's price looks like a reasonable tax rather than a 3–5× luxury.
-
test-first vs test-after never separates on quality and barely separates on cost — and the cost gap is not monotonic. test-after had the lower median total cost in all three sizes, but the difference is significant only at medium (Δ=$0.110, 6/6 tasks, p=0.031); at small (Δ=$0.012, p=0.22) and large (Δ=$0.0075, 3/3 split, p=1.0) the two are effectively tied. The strict-TDD cost penalty the prior sonnet run reported on single-module tasks did not generalize to either smaller or larger work.
-
The large tier did what it was added to do: it broke the quality-sensor saturation — and the coverage/mutation axis still found no workflow makes better-tested code. Small and medium katas pin every arm at 100% coverage / 1.0 mutation (no signal). On the large multi-file packages coverage fell to 96.6–98.2% and mutation to 0.911–0.924 — finally a measurable spread — yet the three arms land on top of each other (≤1.6 pp coverage, ≤0.013 mutation apart). On that axis, the workflow does not move it.
-
A third quality lens — running the dev-team review agents over each arm's produced code — does separate them, and it favors the pipeline. Coverage and mutation are blind to structure/complexity/duplication; the review agents are not. Re-generating one solution per arm for the 6 large tasks and running a 6-agent panel (structure, complexity, naming, performance, security, test) over each, the build-pipeline produced the fewest review-grade findings (weighted 67, median 10.5/task) — cleaner than test-first (108, 19.0) and test-after (91, 15.0) — and cleaner on 5 of 6 tasks than either hand-driven arm (vs test-first sign p=0.06; vs test-after p=0.22). Its inline
/buildreview step is the plausible cause. This is directional, not conclusive (n=6; the review agents show real run-to-run variance — see the section below), but it is the first quality signal in this whole line of experiments that points anywhere, and it points at the pipeline.
Bottom line. At a fixed strong model, the workflow you choose changes cost
and review-grade structure, but not correctness or coverage/mutation test
quality. test-after is the cheapest everywhere (decisively only on mid-sized
work); strict test-first is never cheaper and drew the most review findings; the
/plan→/build pipeline is the most expensive but its premium shrinks toward
parity as task complexity grows (4.7×→2.6×→1.3×) and its code is the cleanest
on review (5/6 large tasks). The large tier is where the pipeline's economics
finally make sense — a ~1.3× premium that now buys a directional but consistent
structural-quality edge — even though, with N=6, the cost differences there are
no longer statistically distinguishable from the instruction arms.
What this run adds over the prior consolidated report¶
The prior report compared the arms on small katas at haiku and "larger" single-module tasks at sonnet, and left one question open: does anything separate the workflows on bigger work, given the katas saturate every sensor? This run:
- Holds the model fixed at
claude-sonnet-4-6across all three sizes so size is the only moving axis. The prior small campaign was haiku, so small is re-run here at sonnet; the prior sonnet "larger" campaign is the medium tier and is folded in unchanged (same model, tasks, harness, grading). - Adds a genuinely large tier — six multi-file packages (≥2 source modules + a public API) with withheld behaviour-modifying changes touching ≥2 files.
- Reports a clean 3×3 (size × arm) grid with paired across-task statistics.
Arms & tiers¶
- build-pipeline — real dev-team
/plan→/build(self-approves its plan headlessly so the human gate cannot stall it). - test-first (TDD) — strict RED-GREEN-REFACTOR.
- test-after (non-TDD) — all production code first, tests authored at the end.
| Size | Tasks | Shape |
|---|---|---|
| small | word-tally, roman, fizzbuzz, rpn, rle, caesar | single-function katas |
| medium | stats, intervals, timeparse, money, matrix, csvlite | one module, multi-function |
| large | spreadsheet, json-pointer, task-scheduler, template-engine, ledger, url-router | 2–4 source modules + public API |
The large tier was authored for this run. Each is a stub package with a
≥8-scenario spec, a withheld ≥2-file change, and hidden acceptance tests
(acc.py / acc_change.py) injected only at grading. Every acceptance file was
validated against a reference solution before running (reference passes; the
shipped golden repo is stubs only, so the build cannot cheat off the graded
tests). Fixtures live in evals/fixtures/exp-tdd-<task>/ with manifests in
evals/experiments/.
Pre-registration (fixed before looking at results)¶
- Model:
claude-sonnet-4-6, fixed for the whole run. - Trials: N=3 per (task × arm) for the instruction arms; N=2 for build-pipeline (≈2–3× the cost/time).
- Unit of inference: the task. Per (task, arm) take the median across trials of the two-stage total cost (build + change); form paired arm differences across the 6 tasks within a size; test with an exact sign test and exact Wilcoxon signed-rank test.
- Stopping rule: run the pre-registered N; no data-dependent stopping.
- Quality read from the build stage only (
self_coverage.percent,mutation.score); any value uniform across all arms is a saturated sensor, not a finding.
Results — the 3×3 grid¶
Small (6 tasks)¶
| Arm | Correct build | Correct change | Median total cost | Cov% | Mutation | Median turns |
|---|---|---|---|---|---|---|
| build-pipeline | 12/12 | 12/12 | $0.940 | 100.0 | 0.988 | 15.5 |
| test-first | 18/18 | 17/18 | $0.217 | 100.0 | 1.0 | 8.0 |
| test-after | 18/18 | 17/18 | $0.198 | 100.0 | 1.0 | 8.0 |
Medium (6 tasks)¶
| Arm | Correct build | Correct change | Median total cost | Cov% | Mutation | Median turns |
|---|---|---|---|---|---|---|
| build-pipeline | 6/6 | 6/6 | $0.552 | 100.0 | 1.0 | 14.5 |
| test-first | 6/6 | 6/6 | $0.334 | 100.0 | 1.0 | 17.0 |
| test-after | 6/6 | 6/6 | $0.215 | 100.0 | 1.0 | 8.0 |
Large (6 tasks)¶
| Arm | Correct build | Correct change | Median total cost | Cov% | Mutation | Median turns |
|---|---|---|---|---|---|---|
| build-pipeline | 12/12 | 11/12 | $0.818 | 96.6 | 0.924 | 16.25 |
| test-first | 18/18 | 15/18 | $0.665 | 98.2 | 0.923 | 14.0 |
| test-after | 18/18 | 17/18 | $0.613 | 97.8 | 0.911 | 12.0 |
Paired statistics (across the 6 tasks in each size; +Δ ⇒ first arm costs more)¶
| Size | Pair | Median Δ | Direction | Sign p | Wilcoxon p |
|---|---|---|---|---|---|
| small | BP vs test-first | +$0.716 | BP higher 6/6 | 0.031 | 0.031 |
| small | BP vs test-after | +$0.715 | BP higher 6/6 | 0.031 | 0.031 |
| small | test-first vs test-after | +$0.012 | TF higher 5/6 | 0.219 | 0.438 |
| medium | BP vs test-first | +$0.198 | BP higher 5/6 | 0.219 | 0.094 |
| medium | BP vs test-after | +$0.336 | BP higher 6/6 | 0.031 | 0.031 |
| medium | test-first vs test-after | +$0.110 | TF higher 6/6 | 0.031 | 0.031 |
| large | BP vs test-first | +$0.179 | BP higher 5/6 | 0.219 | 0.094 |
| large | BP vs test-after | +$0.176 | BP higher 6/6 | 0.031 | 0.031 |
| large | test-first vs test-after | +$0.008 | 3/3 split | 1.000 | 0.688 |
The build-pipeline cost-multiplier vs the cheapest arm falls monotonically with size: 4.74× (small) → 2.57× (medium) → 1.33× (large). The test-first/test-after ratio is 1.09 → 1.55 → 1.08 — the strict-TDD penalty peaks at medium and is negligible at both ends.
What separates (and what doesn't)¶
- Correctness does not separate the arms. Build-stage correctness is 100% for every arm at every size (and the one small build-pipeline "miss" is a change-stage cell, not a build). The withheld changes are genuinely hard on the large tier — change-stage pass rates dip (test-first 15/18, build-pipeline 11/12, test-after 17/18) — but with 1–3 failures per 18 cells this is noise, not a workflow effect, and notably it is test-first, not test-after, that logged the most large-tier change failures.
- Coverage/mutation test quality does not separate the arms. Where those sensors have any resolution (the large tier) the three arms sit within 1.6 pp of coverage and 0.013 of mutation score. There is no "test-first writes stronger tests" signal in the coverage/mutation data.
- Review-grade quality does separate them (see the dedicated section). The dev-team review agents find the build-pipeline's code cleanest (fewest weighted findings on 5/6 large tasks vs both hand-driven arms) — a signal coverage and mutation are structurally blind to. Directional at n=6, but consistent.
- Cost separates the arms, and only cost. build-pipeline is the most expensive in all three sizes; test-after is the cheapest in all three. The interesting structure is in the magnitudes: the pipeline premium amortizes with task size, and the TDD penalty is real only in the middle.
- Turns track cost. test-after consistently runs the fewest turns (8–12); test-first runs more on medium (17) where its cost penalty is largest; the pipeline runs 14.5–16 turns of planning+review overhead regardless of size, which is why it dominates cost on tiny tasks and amortizes on big ones.
Third quality lens: review-grade defect density (large tier)¶
Coverage and mutation measure whether the tests pin the behavior; they are
blind to how the production code is structured. To probe that, one build-stage
solution per arm was regenerated for the 6 large tasks (kept on disk; the
experiment deletes its worktrees) using the harness's exact arm prompts, and a
fixed 6-agent review panel — structure-review, complexity-review,
naming-review, performance-review, security-review, test-review — was run
over each. Findings are weighted critical=4, high=3, medium=2, low=1.
| Arm | Critical | High | Medium | Low | Findings | Weighted | Median/task |
|---|---|---|---|---|---|---|---|
| build-pipeline | 1 | 9 | 12 | 12 | 34 | 67 | 10.5 |
| test-after | 4 | 10 | 14 | 17 | 45 | 91 | 15.0 |
| test-first | 3 | 14 | 18 | 18 | 53 | 108 | 19.0 |
Paired across the 6 tasks (weighted findings; +Δ ⇒ first arm has more/worse findings):
| Pair | Median Δ | Cleaner arm wins | Sign p |
|---|---|---|---|
| build-pipeline vs test-first | −6.5 | build-pipeline in 5/6 | 0.0625 |
| build-pipeline vs test-after | −5.0 | build-pipeline in 5/6 | 0.219 |
| test-first vs test-after | +3.5 | test-after in 4/6 | 0.375 |
Reading it: the build-pipeline's code carried the lowest review-grade defect
density on 5 of 6 tasks against both hand-driven arms — the inline review step
inside /build is the obvious mechanism. test-after edged test-first (test-first
drew the most findings overall, largely from complexity-review — strict
RED-GREEN-REFACTOR left more long/deeply-nested functions on the
spreadsheet/template/router parsers). This is the only quality axis in this
whole experiment line that points anywhere.
Trust it cautiously. The review agents show real run-to-run variance on
near-identical code: naming-review scored the three arms 0 / 19 / 4 (weighted)
when the same magic-string token types exist in all three, and test-review
scored test-first's tests far harder (44) than the others (18 / 19), partly for
one over-specified assertion. Per-agent noise is large; the aggregate direction
(build-pipeline lowest on structure, complexity, performance, and test agents) is
what carries the signal, and at n=6 the sign tests are p≈0.06–0.4 — directional,
not conclusive. All 18 solutions were correct (they pass the hidden acceptance),
so these are style/structure findings, not bugs.
Why the pipeline premium shrinks (mechanism)¶
The /plan→/build pipeline pays a roughly fixed overhead per task: a
planning pass, batched inline review, and a structured build loop (≈15–16 turns
across all sizes). On a one-function kata that fixed cost is 4–5× the entire
hand-written solution; on a six-scenario multi-file package it is a third again
on top of work that was already substantial. Same surcharge, very different
denominator. This is the regime the pipeline was designed for, and the large tier
is the first place its economics are defensible — a ~1.3× premium that, per the
review lens above, also buys directionally cleaner structure (it buys no
correctness or coverage/mutation edge, but the review agents do favor it).
Limitations¶
- n = 6 tasks per size. The exact paired tests bottom out at p≈0.03 with 6 pairs, so single-size results are directional; the cross-size pattern (the monotone pipeline premium) is the durable signal, not any one p-value.
- Single model. Prior work showed the cost winner flips with model; this run fixes sonnet and says nothing about haiku/opus.
- build-pipeline at N=2 has less within-task stability than the instruction arms at N=3; its medians are noisier (e.g. template-engine's $2.22 outlier).
- Medium tier folded from the 2026-06-21 sonnet run (N=1/trial) rather than re-run — same model/tasks/harness/grading, but a lower trial count than small/large. Noted, not hidden.
- Quality is build-stage only. The change-stage coverage/mutation sensors had a uniform-across-arms artifact in the prior run (its Limitation 4); they are excluded here.
- The review lens is a separate, smaller study. It regenerates one solution per (arm × large task) — N=1, not the 2–3 trials of the cost data — and grades it with review agents that have measurable run-to-run variance. Its 5/6 build-pipeline result is the most interesting finding here and the least statistically settled; treat it as a hypothesis worth a powered follow-up (more trials per cell, multiple review passes averaged), not a conclusion. Findings are also self-graded by the same model family that wrote the code.
- Mutation is the built-in AST sampler (cap 40), not a full tool. It now runs under a 30 s per-test timeout (added this run — see below); an infinite-loop mutant counts as killed.
Methodology notes & a harness fix landed this run¶
Paired, multi-arm, repeated-trial, two-stage. Each cell (task × arm × trial ×
stage) ran in its own ephemeral git worktree and its own scratch $HOME,
dispatched as a fresh claude -p --output-format json (verified
cost/tokens/turns; no session resume). Stage 2 applied the withheld change seeded
from the Stage-1 files only. Acceptance tests were hidden during the build and
injected only at grading. Cost is read from the native JSON result. The
build-pipeline arm got a plugin-enabled $HOME per cell.
Review lens (separate pass): because the experiment deletes each cell's
worktree, the produced code is not retained. For the review-grade lens, one
build-stage solution per (arm × large task) was regenerated and kept with the
harness's exact arm prompts (/tmp/gen_solutions.py), then graded by the 6-agent
review panel; results are in
data/3sizes-review-findings.json.
Fix landed: the mutation/coverage test runs had no wall-clock cap, so a
mutation that turned a roman-numeral subtractive loop infinite hung pytest
forever and stalled the whole cell. The runner now wraps every agent/mutant test
invocation in a timeout (PYTEST_TIMEOUT, 30 s), treating a timed-out run as a
failed run. The two affected cells were re-run cleanly under the fix; the final
data set has 0 timeouts and 0 errors across all 192 cells.
Reproducibility¶
pip install coverage pytest
TPL=/tmp/build-home-tpl; mkdir -p "$TPL/.claude"
cp ~/.claude/settings.json "$TPL/.claude/"; cp -r ~/.claude/plugins "$TPL/.claude/"
# per size, per task: instruction arms @3 trials + build-pipeline @2 trials
python3 scripts/run_tdd_experiment.py --arm test-first --arm test-after \
--only "<task>" --trials 3 --model claude-sonnet-4-6 --out small.jsonl
python3 scripts/run_tdd_experiment.py --arm build-pipeline \
--only "<task>" --trials 2 --model claude-sonnet-4-6 \
--build-home-template /tmp/build-home-tpl --out small_build.jsonl
python3 scripts/analyze_tdd_experiment.py \
--data data/3sizes-small-sonnet-2026-06-22.jsonl \
--data data/tdd-largetask-sonnet-2026-06-21.json \
--data data/3sizes-large-sonnet-2026-06-22.jsonl \
--json data/3sizes-3arms-summary.json
Recommendation¶
- For cost and correctness, the workflow is a cost lever, not a quality lever. Across 18 tasks and a fixed strong model, no workflow produced more-correct or better-covered/mutation-tested code; on those axes they differed only in price.
- But review-grade quality is the exception, and it favors the pipeline. On
the large tier the
/plan→/buildpipeline's code carried the lowest review-grade defect density (cleaner on 5/6 tasks vs both hand-driven arms). Combined with its premium shrinking to ~1.3× on large work, this strengthens the case for the pipeline on large, multi-file features: you pay a third more and get measurably cleaner structure (directionally; n=6) at equal correctness. On small/medium katas the pipeline is still a 2.6–4.7× tax for code the review agents barely fault either way — hard to justify there. - Default to writing the code and testing it for small/medium work. test-after is the cheapest arm everywhere and its code is no worse than test-first's on review (slightly better, in fact). Strict test-first is never cheaper and drew the most review findings here — use it for its design/iteration discipline, not an expectation of cheaper, better-tested, or cleaner output at this model strength.
- Next: the most interesting and least-settled result is the review-grade edge for the pipeline. A powered follow-up — multiple trials per (arm × task), several review passes averaged to beat reviewer variance, and ideally a human-graded subset — would confirm or kill it. The coverage/mutation axis, by contrast, needs harder tasks (where coverage drops further) or a longer change-chain to separate the arms at all; that, not more katas, is where any test-quality difference would surface.