This round of five workstreams revealed a reusable SOP. Not pre-designed — grown from practice.
Dual-path entry: explore or execute
Every workstream starts at one of two entry points, depending on how well the problem is understood.
Unknown territory Known territory
│ │
▼ ▼
Battle-test Thesis directly
(ws-009) (ws-013)
│
▼
Observe failure → Form thesis
│
└──────────┬──────────────────┘
▼
Challenge Battle-test first. ws-009 built a deliberately fragile coding-agent first. It observed three specific failure modes — every “hi” cost 5 LLM calls, error handling lived exclusively in CALL_LLM, and JSON contracts were a poor fit for conversational turns — before forming the thesis that became ws-011.
Thesis first. ws-013 knew the ToolPipeline protocol existed but hadn’t been consumed. The problem was clear. No battle-test needed.
Rule of thumb: If you are not sure where the architecture will break, battle-test first. If the problem is already well-defined, skip straight to thesis.
Challenge: oracle reads source, gives line-level verdict
The thesis enters a formal challenge. The oracle receives the thesis plus the critical source code paths and returns a structured verdict.
Thesis → [dispatch oracle with: thesis + key source paths]
│
▼
Oracle output:
- Premise-by-premise: HOLDS / BREAKS (with line numbers)
- Where It Breaks (specific code paths)
- Blind Spots (what the thesis didn't cover)
- Category Assessment (demo vs production)
│
▼
challenge.md
│
breaks > 0 → Response phase
breaks = 0 → Contract phase ws-011’s challenge produced an 11-page oracle verdict identifying 5 breaks and 7 blind spots. ws-010’s challenge proved the direction was fundamentally infeasible — the contract was never written.
What makes this work: The oracle is not asked “is this design good?” It is given source code and asked “prove why and where it breaks.” Line-number citations mean the response can target fixes precisely.
Response → Synthesis → Contract: three layers of translation
A single jump from “challenge found problems” to “start building” is where most processes collapse. The SOP inserts two intermediate translations.
Challenge breaks + blind spots
│
▼
Response.md Each gap gets a solution.
(tactical layer) Specific schemas, signatures, contracts.
│
▼
Synthesis.md Extract shared patterns from
(architectural layer) the solutions. Check internal
consistency across all responses.
│
▼
Contract.md Verifiable scope: file boundaries,
(execution layer) acceptance criteria with before/action/after. | Layer | Artifact | Purpose |
|---|---|---|
| Tactical | Response.md | Per-break design decisions, concrete schema/signatures |
| Architectural | Synthesis.md | Design philosophy extraction, cross-response consistency check |
| Execution | Contract.md | File boundaries, verifiable acceptance criteria |
ws-011’s Response resolved all 5 breaks and 7 blind spots into 11 protocol definitions. Synthesis checked internal consistency — specifically whether the Policy-at-kernel vs kernel/tool abstraction tradeoff was handled correctly. Contract produced 12 verifiable acceptance criteria.
Why the layering works: Response ensures every identified problem has a concrete solution. Synthesis ensures those solutions don’t contradict each other. Contract ensures the scope is bounded and testable.
Build → Audit → Retry: dual-verification loop
Execution runs inside a two-role verification loop. Build and Audit are never the same agent.
Contract
│
▼
Build Produces handoff.md + code
│
▼
Audit (independent role) Reads contract + handoff + source
│ Verifies every AC
├─ PASS + deeper insight → Spiral turn (depth++)
├─ PASS + no insight → Archive
└─ FAIL → Retry (attempt++, back to Build)
│
▼
Re-Audit ws-011’s attempt 0 returned FAIL: 4 ACs passed, 8 failed. Every failure was the same class: “test does not exist” or “test was not updated.” Attempt 1 fixed the P0s — removed a forbidden import, added deprecation warnings, built the missing tests — and passed 12/12.
Why the loop holds: Build and Audit are different roles with different incentives. Build naturally believes “it’s done.” Audit reads only the contract and the evidence — promises are not evidence. ws-011 attempt 0’s handoff claimed “all ACs ready to test.” The audit report said: “zero tests were updated.”
What it adds up to
| Move | What it prevents |
|---|---|
| Battle-test | Building on an unvalidated assumption |
| Challenge | Letting a weak thesis enter execution |
| Response → Synthesis → Contract | Jumping from critique straight to code |
| Build → Audit → Retry | Self-approval without independent verification |
None of this was designed upfront. It crystallized across five workstreams because each missing step hurt. The SOP is just a record of which gaps were painful enough to fix.