Test Evaluation and Architecture¶

This document explains how to evaluate how an existing application is tested and design a path toward a fast, deterministic, config-free CI gate that fully validates behavior — including cross-service interaction — without standing up the rest of the system.

Purpose¶

The test evaluation workflow answers two questions: "how well is this application tested today?" and "what would a CD-aligned test architecture look like?" The result is an assessed gap list and a concrete migration path, not generated test code. Implementation of the migration goes to /plan or /build.

Tools and Their Altitudes¶

Four tools operate at different scopes. Use the one that matches what you need.

What you want	Tool	Altitude	Direction
Advise on how to test a specific module or hard-to-test unit	`test-design-advisor` skill	Unit / module	Forward (design)
Review test files in a changeset for smells, quality, and a suite-wide Farley Score	`/test-design`	Per-file / changeset	Backward (review)
Audit the whole test suite's strategy, quadrant coverage, and automation maturity	`test-health` skill	Whole suite	Strategic rollup
Assess the application's test strategy against CD: per-component types, pre-merge gate determinism	`cd-test-architecture` skill	Whole application	Architecture
Modernize a legacy repository's tests to hit CD targets (≥ 90% coverage, zero surviving mutants, 100% deterministic, fastest pre-merge wall-clock)	`/test-modernize`	Whole repository	Remediation

test-design-advisor works at the module level: assess testability blockers, place each behavior on the pyramid, choose the right test double, and produce a behavior-preserving refactor sequence to introduce seams. It does not write tests. Vocabulary is locked to MinimumCD (static analysis / unit / component / contract / integration / E2E); prefer "contract test" over "narrow integration test" and gloss the alias once if it must be used. The pyramid is a cost heuristic, not a target shape — the advisor never emits "current shape vs recommended shape" tables or per-layer target counts; placements are per-behavior with a two-direction justification (why not the layer above or below). E2E justification gate: E2E is recommended only when (1) a contract test cannot pin the boundary, (2) a component test with doubles cannot exercise the behavior, (3) a resilience test cannot cover the failure mode, AND (4) the behavior is a critical multi-component user journey. E2E is never pre-merge.

/test-design is the orchestrator command for the changeset-level workflow. It dispatches test-review (tactical quality: missing assertions, non-determinism mechanics, mock hygiene) and test-smell-review (design-level smells: xUnit smell taxonomy, double selection, pyramid placement) in parallel, scores every existing test in the suite with the Farley Score (via the farley-score skill — 8 properties, weighted 1–10), then optionally runs test-design-advisor for production code that has no tests or hard-to-test units. The aggregated report carries the headline Farley score independent of the changeset scope.

test-health is the strategic-altitude rollup over the whole repository. It maps coverage to the Agile Testing Quadrants, evaluates the suite's shape against the architecture, rolls up automation maturity and flaky-test signals, and produces an ordered improvement plan. It delegates rather than re-derives: CD-determinism + pipeline placement come from cd-test-architecture, per-file findings + Farley Score come from /test-design, assertion strength on critical-logic modules comes from mutation-testing. Use this for "audit our tests" / "test strategy review" / "is our testing healthy?".

cd-test-architecture works at the application level: inventory components and test suites, classify against the six MinimumCD test types, identify CD-fitness gaps, recommend a per-component target architecture (with the four-condition E2E justification gate applied to every E2E recommendation), and produce a migration path. It does not write tests or edit code.

/test-modernize is the remediation altitude — what to do with a cd-test-architecture assessment. It sequences five gated phases (analysis → public-interface Gherkin → audit + baseline coverage → no-refactor tests added → minimum refactor and converge on quality targets), holds the human gate between each phase, and writes Phase-tagged Stories to the tracker the parent issue URL points at (GitHub, ADO, GitLab, Jira — or local plan files when no URL or CLI is available). It does not invent its own assessment; Phase 1 invokes /cd-test-architecture directly. See the workflow diagram in Architecture.

How they compose. Start at the altitude that matches the question. test-health calls /test-design, cd-test-architecture, and mutation-testing internally — when the question is strategic, do not dispatch the lower-altitude tools yourself. /test-design calls test-design-advisor internally when --advise applies. /test-modernize calls /cd-test-architecture as its Phase 1 — when the question is "how do we get from this assessment to passing CD gates?", start with /test-modernize and let it dispatch the assessment itself. When two altitudes plausibly fit, prefer the higher one and let it delegate down.

The Evaluation Workflow¶

The cd-test-architecture skill follows these steps. Run it with /cd-test-architecture <path>.

Step 1: Inventory the application's components¶

Map each deployable or testable surface and assign it a pattern from knowledge/component-test-patterns.md:

UI — User Interface
Services — API Provider, API Consumer, Event Consumer, Event Producer, Stateful Service, CLI/Library
Batch — Scheduled Job

A real system is usually several of these. Each surface is assessed separately.

Step 2: Inventory and classify existing tests¶

Find every test suite in the repo. For each, record: MinimumCD type, what it actually exercises, whether it is deterministic, and what it requires to run (DB URL, broker, downstream service, secrets, sleep, real clock).

If in-repo tests are sparse, the application is not necessarily untested — see Step 2b before concluding.

Step 2b: Locate and harvest out-of-repo tests¶

When --external-tests <path-or-repo-or-description> is given, treat the external location as the current specification of intended behavior:

Other-repo suites — read and classify just like in-repo tests; note they can't gate this component's merges.
Postman/Insomnia/.http collections — extract each request + assertion as an API contract and scenario.
Manual scripts or spreadsheets — extract each step as a behavior to automate.

This produces a behavior inventory that becomes the basis for improvement, not the destination.

Step 3: Diagnose CD-fitness gaps¶

Flag, with evidence:

Out-of-repo or third-party-runner testing (anti-pattern — see below)
Manual / non-repeatable testing
Tests mistyped as "unit" that require real dependencies
Configured-dependency tests that can't run in a clean CI gate
Coverage gaps (success + failure modes not covered at any deterministic layer)
Doubles with no validation loop (drift risk)
No consumer resilience tests (the component assumes the provider holds)
Inverted pyramid shape (integration/E2E doing what component tests should)

Per component: which test types cover which layers, what to double to run pre-merge without configuration, which success scenarios and failure modes to cover, the double-validation loop, and the pipeline stage for each test type (pre-merge gate, Stage 1/2, out-of-band, or post-deploy).

The recommendation applies the E2E justification gate: every E2E test must document that (1) a contract test cannot pin the boundary, (2) a component test with doubles cannot exercise the behavior, (3) a resilience test cannot cover the failure mode, AND (4) the behavior is a critical multi-component user journey. E2E recommendations that fail any of (1)–(3) are replaced with the cheaper layer that can cover them. The pyramid is treated as a cost heuristic — no per-layer target counts are recommended; if the shape is pathological (ice-cream cone, hourglass, cupcake), the pathology and the behaviors that suffer from it are named, not a numeric redistribution.

Step 5: Produce a migration path¶

Ordered lowest-risk first, each step independently shippable. The spine is baseline before refactor: get behavior under test at existing seams without changing code, then refactor under that green baseline. When tests are out-of-repo, the harvested behaviors feed that baseline. Typical full sequence:

Characterization baseline (no refactoring) — outside-in tests at the outermost reachable seam that lock in current behavior; reproduce any harvested out-of-repo/manual behaviors here. Get green.
Introduce owned adapters and seams under the baseline (DDD skills suggest where boundaries belong)
Add in-memory doubles + component tests reproducing the baselined behaviors
Add contract tests pinning request/response boundaries
Add consumer resilience tests (verify the component survives a provider break)
Add scheduled provider-contract verification against a test environment
Move real-dependency tests off the gate to adapter integration or out-of-band
Add post-deploy checks
Decommission out-of-repo/manual suites and the coarse characterization tests as their behaviors land in the deterministic gate

Step 6: Report¶

Output goes to reports/cd-test-architecture-<app>.md. Tables, not prose: components and patterns, current tests and their CD-fitness, gaps, target architecture, pre-merge gate composition, migration path, next steps.

When the Tests Aren't in the Repo¶

An application may have little or no in-repo testing and instead be covered by suites in another repo, a third-party runner, Postman or Insomnia collections, or manual scripts. This is an anti-pattern regardless of how thorough the external coverage is:

The tests cannot gate the component's own merges — the build can go green while behavior is broken.
The tests are not versioned with the code they verify; a code change and its test change can't move together.
External suites are usually non-deterministic and environment-coupled, so they could never serve as a pre-merge gate anyway.
Manual scripts are not repeatable — they're a checklist, not a regression net.

This does not mean the external coverage is worthless. It is the current specification of intended behavior — the best available basis for improvement.

To include it in the assessment, point the skill at it:

/cd-test-architecture <path> --external-tests <postman-collection.json>
/cd-test-architecture <path> --external-tests <path-to-other-repo>
/cd-test-architecture <path> --external-tests "manual regression scripts in Confluence, linked here: ..."

The skill harvests those sources as a behavior inventory (Step 2b) and builds the migration path around re-expressing each behavior as a deterministic, in-repo, gated test:

External source	Re-expressed as
Postman request + assertion	Component or contract test
Manual UI script	UI component test (real browser, network stubbed)
Other-repo E2E covering this component	In-repo component test + thin post-deploy smoke

Each external case is decommissioned once its behavior lands in the gate.

If in-repo tests are sparse but no --external-tests location is given, the skill will ask where the application is actually tested before drawing any conclusions.

Key Principles¶

Pre-merge gate: deterministic tests only¶

The gate that blocks a merge may contain only static analysis, unit, component, and contract tests. These are deterministic and need nothing configured. Integration and end-to-end tests are non-deterministic by nature and never gate a merge. A test that needs a database URL, broker, downstream service, or environment secrets to run is mis-typed — re-classify or convert it.

The corollary is that E2E is the last resort, not a quota. The four-condition E2E justification gate (Step 4) ensures recommendations don't propose E2E "for completeness" or "to round out the pyramid" — if a contract, component, or resilience test can cover a behavior, that's where it goes.

Run CI without configuring dependencies¶

The component test is the workhorse of a CD gate. The pattern is consistent across every component type:

Assemble the real component — actual handlers, domain logic, orchestration — in-process.
Replace only what the team doesn't control with in-memory doubles: in-memory repository for the database, in-memory bus for the broker, stubbed adapter for downstream services, injected fixed clock.
Drive it through its public interface — HTTP handlers, message handler, job entrypoint, UI via a real browser with the network stubbed.
Assert observable outcomes — status, persisted state, emitted event, rendered output — never internal call sequences.

The result: fast (no I/O), deterministic (no real systems, controlled clock), zero configuration of the surrounding system — while still validating real behavior end-to-end within the component boundary.

The adapter rule¶

Wrap every third-party client (SDK, HTTP client, broker client, DB driver) in a thin adapter the team owns. Double the adapter in component tests — never mock the third-party SDK directly. Adapter integration tests then exercise the real adapter against a real container to confirm the adapter's correctness.

Do not depend on provider cooperation¶

Consumer-driven contract verification where the provider runs your contract in their pipeline only works with close collaboration and enforced tooling. Assume you do not have that. The defense you own:

Contract tests (pre-merge) pin the request you send and the response shape you depend on, against the adapter double.
Scheduled provider-contract verification in a test environment — you run your pinned contract against the provider's real non-prod endpoint on a schedule, out-of-band, owned by your team. This detects a provider break when it happens, not at your next unrelated deploy.
Resilience component tests (pre-merge) verify the consumer survives a broken contract: timeouts enforce, retries and circuit breakers behave, malformed responses are handled, the caller gets a documented response with no partial state.

Provider-side verification of your contract is a bonus if they offer it — not the mechanism to rely on.

Baseline before refactor (legacy code)¶

Legacy code is code without tests (regardless of age). When a component is poorly tested, do not lead with refactoring:

Find the testable seams — places where behavior can be observed or substituted without editing the code (HTTP handler, CLI entrypoint, message handler, exported function, existing injection points).
Write the best outside-in tests achievable now, without refactoring — characterization tests at the outermost reachable seam that lock in current behavior. This is a behavior baseline, not yet a clean gate.
Get the baseline green — your safety net.
Refactor to improve testability under green — introduce adapters and seams, push checks down to deterministic component/unit tests. Never change behavior and structure in the same step.
Let the domain guide the target — the domain-driven-design and domain-analysis skills suggest where boundaries and seams should land.

The mechanics live in the legacy-code skill; this workflow places it in the CD test architecture. An assessment of an under-tested component therefore returns two things: the outside-in baseline writable today, and the refactor sequence that improves testability afterward.

Sample Invocations¶

# Full application assessment
/cd-test-architecture ./src

# Scope to one component
/cd-test-architecture ./src --component payment-service

# Include existing CI config in the assessment
/cd-test-architecture ./src --ci .github/workflows/ci.yml

# Application tested primarily via Postman collections
/cd-test-architecture ./src --external-tests ./test-collections/api-tests.postman_collection.json

# Application tested in another repo
/cd-test-architecture ./src --external-tests "../qa-repo/e2e/payment-service"

# Per-file / changeset review + suite-wide Farley Score (current working tree or staged changes)
/test-design

# Per-file review scoped to a directory
/test-design --path src/payments

# Per-file review of changes since a branch
/test-design --since main

# Force the advisor to run (also auto-triggers when production code has few/no tests)
/test-design --advise

# Unit/module design advice (advisory — does not write tests)
/test-design-advisor src/payments/PaymentProcessor.ts

# Strategic suite-wide audit — delegates to cd-test-architecture, /test-design, and mutation-testing
/test-health
/test-health --path src/payments

# Assertion strength on critical-logic modules
/mutation-testing

Reference Files¶

File	What it defines
`agents/qa-engineer.md`	The Senior SDET agent that routes strategic test requests to these skills
`agents/test-review.md`	The tactical per-file test-quality review agent
`agents/test-smell-review.md`	The smell-detection review agent
`knowledge/cd-test-architecture.md`	Six MinimumCD test types, the pre-merge gate rule, out-of-repo anti-pattern, component test pattern, adapter rule, double validation, determinism techniques
`knowledge/component-test-patterns.md`	Per-component patterns: UI, API Provider, API Consumer, Event Consumer, Event Producer, Stateful Service, CLI/Library, Scheduled Job
`knowledge/database-test-patterns.md`	Database test isolation + teardown: Database Sandbox, Transaction Rollback / Table Truncation Teardown, Fake-first rule for data-logic tests
`knowledge/dependency-breaking-techniques.md`	Feathers' full 24-technique catalog for getting legacy code under test (behavior-preserving seams, seam type + risk)
`knowledge/legacy-test-strategy.md`	Where to test legacy code: effect reasoning, effect sketches, interception/pinch points; plus editing-safety techniques
`knowledge/microservice-testing.md`	Contract and CDC testing across independently-deployable services
`knowledge/test-automation-maturity.md`	Maturity ladder consumed by `test-health` for the strategic rollup
`knowledge/test-automation-principles.md`	Goals + named Principles of Test Automation — the rubric for why a test is good or bad; grounds smell severity
`knowledge/test-doubles.md`	Dummy / stub / spy / mock / fake selection, Configurable vs. Hard-Coded form, Test-Specific Subclass, state-vs-behavior verification
`knowledge/test-matrix-examples/`	Worked, stack-specific placement matrices the advisor adapts (Spring Boot, Django batch, React/Node SPA, SSR + HTMX, .NET API fronting gRPC)
`knowledge/test-pyramid.md`	Pyramid layer responsibilities and shape anti-patterns
`knowledge/test-smells.md`	xUnit smell taxonomy: code, behavior, and project smells
`knowledge/testing-quadrants.md`	Agile Testing Quadrants — what each quadrant protects; consumed by `test-health`
`knowledge/value-patterns.md`	Test-data sourcing: Literal / Derived / Generated Value + Dummy Object
`skills/cd-test-architecture/SKILL.md`	The application-level assessment skill
`skills/domain-driven-design/SKILL.md`	Suggests target boundaries/seams for the post-baseline refactor
`skills/legacy-code/SKILL.md`	Characterization testing + dependency-breaking: the baseline-before-refactor procedure
`skills/mutation-testing/SKILL.md`	Assertion-strength check (do tests catch real bugs?); folded into `test-health`
`skills/test-design/SKILL.md`	The `/test-design` orchestrator skill — dispatches review agents, scores with Farley, optionally invokes the advisor
`skills/test-design-advisor/SKILL.md`	The unit/module design advisor skill
`skills/farley-score/SKILL.md`	Farley Score — Dave Farley's 8 properties scored 1–10, called by `/test-design` Step 3
`skills/test-health/SKILL.md`	Strategic suite-wide rollup; delegates to `cd-test-architecture`, `/test-design`, `mutation-testing`