Refactoring cadence & authorship — Summary¶

A condensed version of refactor-granularity-report.md, which has the full data, statistics, and method.

What we tested: one model (claude-sonnet-4-6) built four small Python components (fare, payroll, cart, grades) under clear specs, then made a chain of three follow-up changes to each. We varied how often the code was refactored and who wrote the tests, holding fixed the rule that refactoring never changes tests. 7 arms × 4 tasks × 13 trials = 364 cells.

Terms¶

Granularity — how often cleanup happens: none (never), one-shot (a single pass), or continuous (a little after every change).
Authorship — single (one agent writes code and tests) vs split (an independent agent writes the tests).
tdd-refactor — the reference arm: test-first, continuous cleanup, single author.
The invariant — a refactor restructures production code only; if a cleanup step touches a test file, that edit is reverted. "Refactoring is behavior-preserving, so it must not change the tests" — enforced, not assumed.
Cumulative blast radius — total lines of code touched to make the three follow-up changes. Lower means the code was easier to change.
CORE / EDGE — stated behavior vs inferred edge cases. MI — radon maintainability index.

Headline¶

On small, clear-spec modules over a three-change horizon, refactoring did not pay for itself, and an independent test author bought nothing for ~3× the cost. The result worth keeping is methodological: treating "tests don't change during refactoring" as an enforced invariant works cleanly — 0 violations in 364 cells — and the harness and tasks are reusable for a fairer follow-up.

Findings¶

1. The corrected methodology held perfectly¶

No refactor step changed a test across all 364 cells — any attempted churn was reverted, and the inline arms attempted none. The "tests stay fixed during a cleanup" rule is enforceable and was respected everywhere. This is the run's most solid result.

2. Refactoring didn't pay back within three changes¶

Cumulative blast rose monotonically with refactoring effort, consistently across all four tasks:

Granularity	Cumulative blast	Cost	MI
none	104.5	$1.13	77.3
continuous	138.5	$1.91	72.7
one-shot	155.0	$2.40	75.0

Maintainability (MI ≈ 72–80) is essentially flat across cadence. On small, already-simple modules, cleanup adds churn and cost without a measurable changeability or structural gain over three changes. This is by construction as much as by behavior: blast counts the refactoring's own lines, so refactor arms look "less changeable" partly because the metric charges them for the cleanup itself (see Limitations).

3. Split authorship cost ~3× single, with no quality gain¶

Authorship	Cost	Mutation	EDGE
single	$0.97	0.95	100%
split	$2.84	1.00	98%

An independent test author roughly tripled cost while moving quality nowhere meaningful — mutation up a hair, EDGE down a hair. On these tasks it did not earn its price.

4. Test quality barely moved¶

Coverage 90–100%, mutation median 0.92, CORE/EDGE 96–100% across every arm. There was little room for granularity or authorship to separate — the sensors were near-saturated on tasks this small and clear.

What this run can and can't tell you¶

It settles the method. "Tests unchanged during refactoring" is enforceable, and the harness/corpus are sound and reusable.

It does not yet settle whether refactoring pays off, for four reasons:

Blast conflates cost with benefit — it counts the cleanup's own churn, so the hypothesized payoff (smaller later changes) is swamped inside a three-change window.
Tasks were too small and clear — single-module features (MI ≈ 75, coverage ~100%) leave refactoring nothing to bite and the quality sensors nothing to discriminate.
Four tasks underpower small effects — the unit of inference is the task; four can't resolve a ~5% changeability difference (see refactor-granularity-power-analysis.md).
Clear specs only — this removes the vague-spec edge-inference signal, so EDGE ≈ CORE here by design, hiding the safety-net effect the earlier study found.

The large, consistent effects (refactor churn, split-authorship cost) are real; the small ones are not resolvable at this scale.

Recommendations¶

Situation	Guidance
Small, clear-spec, short-horizon changes	Neither extra refactoring cadence nor split authorship earns its cost — `none`/single is the efficient choice here.
Choosing a test-authorship model	Independent (split) test authoring is not worth ~3× cost on tasks like these; re-test only at larger scale.
Want to actually measure refactoring's payoff	Run a fairer follow-up: larger multi-file tasks, a longer change chain, and a blast metric that separates refactor-churn from change-churn (or a held-out no-refactor change).
Trust the invariant	Yes — revert-based enforcement of "tests don't change during refactoring" held at 0/364.

What's next¶

One consolidated successor is specified:

05-experiment-prompt-workflow-matrix.md — with refactoring fixed on and specs fixed clear, it asks which agentic workflow (test-first/after × small/big batch × 1/2 agents) yields maintainable, well-tested code at minimum cost. The free-vs-frozen blast-variance contrast and test-churn → blast mediation survive as optional secondary analyses over the same output rows; the larger multi-file / longer-horizon corpus is parked as a future scale-up inside that prompt.

Summary of refactor-granularity-report.md; numbers and method are in the full report.