Refactoring cadence & authorship — results¶
Status: complete. 7 arms × 4 tasks × 13 trials = 364 cells, clear specs,
build + 3-change chain, model claude-sonnet-4-6. Raw data:
data/refactor-granularity-merged.jsonl;
machine summary: data/refactor-granularity-summary.json;
reproduce with python3 scripts/analyze_refactor_experiment.py.
Plain-language summary: refactor-granularity-summary.md.
Design¶
Follow-up to the When Does TDD Pay Off? finding that two refactoring workflows tied on changeability. The original draft crossed a "tests free vs frozen during refactor" factor; that was corrected — refactoring is behavior-preserving and must not change tests, so "tests unchanged during refactor" is an invariant of every arm, not a variable. Enforced by reverting any test-file edit a refactor step makes back to the pre-refactor snapshot (and recording it). We varied:
- granularity: none / one-shot (single pass) / continuous (per increment)
- authorship: single (one agent writes code+tests) / split (independent coder + tester)
- plus tdd-refactor (test-first, continuous, single) as the reference.
Four clean-room tasks (fare, payroll, cart, grades), each with a hidden
CORE/EDGE acceptance suite and a 3-change chain whose trap punishes non-modular code.
Results (per-arm; blast/cost are medians of the 4 task-medians)¶
| arm | cum blast | cost $ | blast/$ | mutation | cov % | MI | CORE | EDGE | violations |
|---|---|---|---|---|---|---|---|---|---|
| tdd-refactor (ref) | 110.5 | 1.56 | 70.8 | 0.78 | 100 | 76.4 | 100% | 100% | 0 |
| no-refactor-single | 103.5 | 0.73 | 141.8 | 0.95 | 100 | 75.0 | 100% | 100% | 0 |
| one-shot-single | 154.5 | 1.97 | 78.4 | 1.00 | 100 | 75.6 | 96% | 96% | 0 |
| continuous-single | 118.5 | 0.97 | 122.2 | 0.92 | 100 | 71.9 | 100% | 100% | 0 |
| no-refactor-split | 105.5 | 1.53 | 69.0 | 1.00 | 100 | 79.6 | 96% | 96% | 0 |
| one-shot-split | 155.5 | 2.84 | 54.8 | 1.00 | 100 | 74.4 | 98% | 98% | 0 |
| continuous-split | 158.5 | 2.85 | 55.6 | 1.00 | 100 | 73.5 | 98% | 98% | 0 |
Factor main effects (test-after grid; reference excluded):
| granularity | cum blast | cost | MI |
|---|---|---|---|
| none | 104.5 | 1.13 | 77.3 |
| one-shot | 155.0 | 2.40 | 75.0 |
| continuous | 138.5 | 1.91 | 72.7 |
| authorship | cum blast | cost | mutation | EDGE |
|---|---|---|---|---|
| single | 118.5 | 0.97 | 0.95 | 100% |
| split | 155.5 | 2.84 | 1.00 | 98% |
Findings¶
1. The corrected methodology held perfectly — 0 invariant violations in 364 cells. No refactor step changed a test (attempted churn was reverted where it occurred; none was attempted in the inline arms). The "tests don't change during refactoring" rule is enforceable and was respected.
2. Refactoring did not pay back within a 3-change horizon on these tasks. Cumulative blast rises monotonically with refactoring effort — none (104) < continuous (138) < one-shot (155) — consistently across all four tasks, and maintainability (MI ≈ 72–80) is essentially flat across granularity. On small, clear-spec modules the cleanup adds churn and cost without a measurable changeability or structural gain over three changes.
3. Split authorship cost ~3× single with no quality gain. single $0.97 vs split $2.84 per cell; split's mutation (1.0 vs 0.95) is marginally higher but its EDGE pass is marginally lower (98% vs 100%). An independent test author bought nothing measurable here while roughly tripling cost.
4. Test quality barely moved. Coverage 90–100%, mutation median 0.92, CORE/EDGE 96–100% across every arm — little room for granularity or authorship to separate on these tasks.
Limitations (read before drawing strong conclusions)¶
- Blast conflates change-cost with refactor-churn. Cumulative blast counts the refactoring's own line changes, so refactor arms score "less changeable" partly by construction. The hypothesized payoff — refactoring makes later changes smaller — is swamped by the cleanup churn within a 3-change window and is not separated in this metric. A cleaner test would measure a held-out change made without refactoring, or record change-churn and refactor-churn separately.
- Tasks are small and clear-spec. Single-module features where the code is already simple (MI ~75) leave little room for refactoring to improve structure or for coverage/mutation to discriminate. The original premise likely needs larger, multi-file tasks and a longer change horizon (6–10 changes) to surface a payoff.
- Four tasks → underpowered for small effects. The unit of inference is the task;
four is too few to resolve a ~5% changeability difference (see
refactor-granularity-power-analysis.md). The large, consistent effects here (refactor churn, split cost) are real; small ones are not resolvable. - Clear specs only removes the vague-spec edge-inference signal that separated arms in the prior study, so EDGE ≈ CORE here by design.
Recommendation¶
- Methodologically, the correction is the headline: treating "tests unchanged during refactoring" as an enforced invariant works, and the harness/ corpus are reusable.
- On these tasks, neither refactoring cadence nor split authorship earns its cost. Refactoring small, already-simple modules over a short change horizon adds churn and spend without measurable benefit; split authorship triples cost for no quality gain.
- To actually test refactoring's changeability payoff, the next run needs (a) larger multi-file tasks, (b) a longer change chain, and (c) a blast metric that separates refactor-churn from change-churn (or a held-out no-refactor change). Until then this run answers the methodology cleanly and the changeability payoff only weakly.