Refactoring cadence & authorship — results¶

Status: complete. 7 arms × 4 tasks × 13 trials = 364 cells, clear specs, build + 3-change chain, model claude-sonnet-4-6. Raw data: data/refactor-granularity-merged.jsonl; machine summary: data/refactor-granularity-summary.json; reproduce with python3 scripts/analyze_refactor_experiment.py.

Plain-language summary: refactor-granularity-summary.md.

Design¶

Follow-up to the When Does TDD Pay Off? finding that two refactoring workflows tied on changeability. The original draft crossed a "tests free vs frozen during refactor" factor; that was corrected — refactoring is behavior-preserving and must not change tests, so "tests unchanged during refactor" is an invariant of every arm, not a variable. Enforced by reverting any test-file edit a refactor step makes back to the pre-refactor snapshot (and recording it). We varied:

granularity: none / one-shot (single pass) / continuous (per increment)
authorship: single (one agent writes code+tests) / split (independent coder + tester)
plus tdd-refactor (test-first, continuous, single) as the reference.

Four clean-room tasks (fare, payroll, cart, grades), each with a hidden CORE/EDGE acceptance suite and a 3-change chain whose trap punishes non-modular code.

Results (per-arm; blast/cost are medians of the 4 task-medians)¶

arm	cum blast	cost $	blast/$	mutation	cov %	MI	CORE	EDGE
tdd-refactor (ref)	110.5	1.56	70.8	0.78	100	76.4	100%	100%
no-refactor-single	103.5	0.73	141.8	0.95	100	75.0	100%	100%
one-shot-single	154.5	1.97	78.4	1.00	100	75.6	96%	96%
continuous-single	118.5	0.97	122.2	0.92	100	71.9	100%	100%
no-refactor-split	105.5	1.53	69.0	1.00	100	79.6	96%	96%
one-shot-split	155.5	2.84	54.8	1.00	100	74.4	98%	98%
continuous-split	158.5	2.85	55.6	1.00	100	73.5	98%	98%

Factor main effects (test-after grid; reference excluded):

granularity	cum blast	cost	MI
none	104.5	1.13	77.3
one-shot	155.0	2.40	75.0
continuous	138.5	1.91	72.7

authorship	cum blast	cost	mutation	EDGE
single	118.5	0.97	0.95	100%
split	155.5	2.84	1.00	98%

Findings¶

1. The corrected methodology held perfectly — 0 invariant violations in 364 cells. No refactor step changed a test (attempted churn was reverted where it occurred; none was attempted in the inline arms). The "tests don't change during refactoring" rule is enforceable and was respected.

2. Refactoring did not pay back within a 3-change horizon on these tasks. Cumulative blast rises monotonically with refactoring effort — none (104) < continuous (138) < one-shot (155) — consistently across all four tasks, and maintainability (MI ≈ 72–80) is essentially flat across granularity. On small, clear-spec modules the cleanup adds churn and cost without a measurable changeability or structural gain over three changes.

3. Split authorship cost ~3× single with no quality gain. single $0.97 vs split $2.84 per cell; split's mutation (1.0 vs 0.95) is marginally higher but its EDGE pass is marginally lower (98% vs 100%). An independent test author bought nothing measurable here while roughly tripling cost.

4. Test quality barely moved. Coverage 90–100%, mutation median 0.92, CORE/EDGE 96–100% across every arm — little room for granularity or authorship to separate on these tasks.

Limitations (read before drawing strong conclusions)¶

Blast conflates change-cost with refactor-churn. Cumulative blast counts the refactoring's own line changes, so refactor arms score "less changeable" partly by construction. The hypothesized payoff — refactoring makes later changes smaller — is swamped by the cleanup churn within a 3-change window and is not separated in this metric. A cleaner test would measure a held-out change made without refactoring, or record change-churn and refactor-churn separately.
Tasks are small and clear-spec. Single-module features where the code is already simple (MI ~75) leave little room for refactoring to improve structure or for coverage/mutation to discriminate. The original premise likely needs larger, multi-file tasks and a longer change horizon (6–10 changes) to surface a payoff.
Four tasks → underpowered for small effects. The unit of inference is the task; four is too few to resolve a ~5% changeability difference (see refactor-granularity-power-analysis.md). The large, consistent effects here (refactor churn, split cost) are real; small ones are not resolvable.
Clear specs only removes the vague-spec edge-inference signal that separated arms in the prior study, so EDGE ≈ CORE here by design.

Recommendation¶

Methodologically, the correction is the headline: treating "tests unchanged during refactoring" as an enforced invariant works, and the harness/ corpus are reusable.
On these tasks, neither refactoring cadence nor split authorship earns its cost. Refactoring small, already-simple modules over a short change horizon adds churn and spend without measurable benefit; split authorship triples cost for no quality gain.
To actually test refactoring's changeability payoff, the next run needs (a) larger multi-file tasks, (b) a longer change chain, and (c) a blast metric that separates refactor-churn from change-churn (or a held-out no-refactor change). Until then this run answers the methodology cleanly and the changeability payoff only weakly.