Experiment Prompt: Which agentic workflow yields maintainable, well-tested code at minimum cost¶
Type: Reusable experiment prompt (hand this whole file to Claude to execute)
Harness: scripts/run_refactor_experiment.py
Lineage: the single, consolidated experiment for the refactor-cadence line. It supersedes
the earlier granularity / larger-corpus designs (former prompts 06 and the cadence-larger
prompt), fixes the variables they left open (refactoring on, specs clear), and re-scopes
everything to the one question below. Prior raw data lives in
data/refactor-granularity-merged.jsonl.
Status: specified, harness ready, NOT run at scale. New arms validated under
--skip-dispatch and a 1-task cost pilot.
The question¶
To achieve code that is maintainable, well structured, and tested with good tests at minimum cost, which agentic workflow works best?
Everything in the experiment exists to answer that and nothing else.
Two crossed factors (the variable matrix)¶
| factor | levels |
|---|---|
| A. Test/code ordering | test-first / test-after |
| B. Batch size | small (incremental, per-behavior) / big (all-at-once) |
| C. Authorship | 1 agent / 2 context-isolated agents |
The four named workflows are exactly the A×B cells:
| workflow | ordering × batch |
|---|---|
| W1 — Classic TDD (Kent Beck) | test-first × small |
| W2 — Code first, tests second, small batches | test-after × small |
| W3 — All code, then all tests | test-after × big |
| W4 — All tests, then code to pass | test-first × big |
Crossing the 4 workflows with the 2 authorship levels gives a 2×2×2 design. One cell — W1 × 2 agents — is dropped: Beck's RED-GREEN-REFACTOR loop is a single integrated cycle, and splitting it across two context-isolated agents is either heavy-handoff ping-pong or no longer TDD. That leaves 7 cells.
Held constant (eliminated as variables)¶
These were variables in earlier runs; the question above fixes them, so they are removed:
| held constant | fixed to | removes |
|---|---|---|
| Refactoring | always, after tests pass each iteration | the none / one-shot-only granularity arms as separate cadence conditions |
| Spec clarity | clear | the clear/vague crossing |
Refactoring being mandatory does not remove a refactor step — every arm refactors after green. "Each iteration" means per-behavior for the small-batch workflows and once-after-green for the big-batch workflows (which have a single iteration).
The 7 cells and the arms that implement them¶
| cell | workflow × authorship | harness arm | status |
|---|---|---|---|
| C1 | W1 TDD × 1 agent | tdd-refactor |
existing |
| C3 | W2 small × 1 agent | continuous-single |
existing |
| C4 | W2 small × 2 agents | continuous-split |
existing |
| C5 | W3 big × 1 agent | one-shot-single |
existing |
| C6 | W3 big × 2 agents | one-shot-split |
existing |
| C7 | W4 big × 1 agent | all-tests-first-single |
new |
| C8 | W4 big × 2 agents | all-tests-first-split |
new |
The existing one-shot arms already are W3: one agent (or coder+tester) writes everything,
then a separate, revertable refactor pass runs after green. Only W4 — write the whole test
suite first, then production code to pass it — was missing. Two new arms add it:
all-tests-first-single: one agent writes the full failing suite, then the production code to pass it, then a separate refactor pass.all-tests-first-split: an isolated tester writes the full failing suite, then an isolated coder writes production code to pass it, then a separate refactor pass.
Both use the one-shot refactor mechanism (separate dispatch, test edits reverted to the green
snapshot) so the tests-frozen-during-refactor invariant holds exactly as in every other arm.
The no-refactor-* arms are not used — refactoring is mandatory now.
Outcome measures (what "best" means)¶
All already emitted by the harness; map 1:1 to the three goals plus cost:
| goal | measured by |
|---|---|
| Maintainable / well structured | radon MI + CC, lizard CCN / tokens, blast radius across the 3-change chain (changeability) |
| Tested with good tests | mutation score, branch coverage, CORE + EDGE acceptance pass, test smells (assertless tests, mock density, sleeps) |
| Minimum cost | cost_usd, tokens, turns (raw and per quality unit) |
Report every quality figure raw and per-dollar, and name the efficient frontier: which workflow buys the most maintainability + test quality per dollar.
Cost (firmed by a 1-task pilot)¶
Per-cell costs: run-04 actuals (cross-task mean) for the 5 existing arms; a 1-task fare
pilot for the 2 new arms, nudged +5% since fare runs ~5% below the cross-task mean.
The pilot arms ran clean — core+edge pass, mutation 1.0, all 3 changes pass, 0 invariant
violations.
| cell | arm | $/cell | basis |
|---|---|---|---|
| C3 | continuous-single |
0.99 | run-04 actual |
| C1 | tdd-refactor |
1.57 | run-04 actual |
| C5 | one-shot-single |
2.01 | run-04 actual |
| C7 | all-tests-first-single |
~2.50 | pilot ($2.39 on fare) |
| C4 | continuous-split |
2.81 | run-04 actual |
| C6 | one-shot-split |
2.85 | run-04 actual |
| C8 | all-tests-first-split |
~3.77 | pilot ($3.60 on fare) |
| per task × trial (7 cells) | ~$16.5 |
| trials/cell | cells | est. campaign cost |
|---|---|---|
| 6 | 168 | ~$400 |
| 8 | 224 | ~$530 |
| 12 | 336 | ~$790 |
Recommended: start at 6 trials/cell (~$400) and sequentially extend only the cells whose maintainability/test-quality-per-dollar ranking is still ambiguous. The two test-first-split cells (C8) are the most expensive; the single-agent cells are 2–4× cheaper.
Secondary analyses (optional, same data — no extra run)¶
Earlier designs in this experiment line chased effects we have since either settled or deliberately scoped out. They need no separate campaign — each is either eliminated or computable from this experiment's existing output rows:
| earlier endpoint | status here |
|---|---|
| EDGE collapse under vague specs | eliminated — vague specs are a settled result (they degrade quality); not worth paying to re-measure |
| clarity × granularity interaction | eliminated — no clarity axis |
"does refactoring help?" (no-refactor arms) |
eliminated — refactoring is a settled premise |
| 20–40-task expansion for a ±5% blast main effect | eliminated — underpowered at four tasks; not this question |
| blast variance, free (big) vs frozen (small) | optional secondary — computable from these rows |
| test-churn → blast mediation | optional secondary — needs only a per-change test-churn series |
| ±5% TOST on blast | optional secondary — underpowered; report as inconclusive |
The three optional secondaries reuse arms already in this matrix (one-shot = big-batch,
continuous = small-batch, tdd-refactor), so they are extra statistics over the same rows,
not a second run. The only marginal harness work — if the mediation is wanted — is decomposing
test-LOC churn per change index, a measurement add rather than a new arm or factor.
Net: this is the one experiment. Run the 7-cell campaign; the secondaries are an analysis pass over its output.
Future scale-up (parked — same question, bigger corpus)¶
If the 4-task / 3-change result needs more external validity, scale the corpus, not the factors — run these same 7 cells on harder tasks. The good ideas from the retired cadence-larger design, kept here so they are not lost:
- Larger multi-file tasks — 3–5 source modules behind a public API, not single-module katas.
- Longer change horizon — an 8-change chain instead of 3, so changeability compounds.
- Held-out changeability probe — after the chain, apply two identical changes with refactoring disabled, to measure the changeability each workflow paid forward.
- Decomposed churn — record change-churn and refactor-churn separately per change index.
- More tasks — the task is the unit of inference; cross-task ranking stability is bounded by task count, and only more tasks (not more trials) tightens it.
This is a larger, more expensive campaign — specify N from a power calc on the prior per-task variance before funding it. It is not part of the base run.
Run plan¶
- Validate with
--skip-dispatch(free) — confirms all 7 arms run end-to-end. - 1-task cost pilot on the new arms (
--task fare --arm all-tests-first-single --arm all-tests-first-split --trials 1) to get real per-cell cost. - Campaign: one command —
scripts/run_workflow_matrix.py. It runs the 7 arms × 4 tasks × 6 base trials, then sequentially extends only the arms whose cost-efficiency (quality-per-dollar) rank is still ambiguous, up to a 12-trial ceiling. Defaults to a dry plan (no dispatch); pass--goto run,--analyze-onlyto print the efficiency frontier from existing data. Holds the model fixed (claude-sonnet-4-6).
Guardrails (carried from run 04)¶
- Hide acceptance suites during build/change; validate each against a reference first.
- The invariant is enforced by reverting test edits in refactor steps — it held at 0 violations in 364 cells.
- Hold the model fixed and report it.
- The task is the unit of inference; with 4 tasks, report per-task and pooled, and be honest about what 4 tasks can and cannot resolve.