Battle-test, challenge, translate, verify: an emergent SOP from five workstreams

This round of five workstreams revealed a reusable SOP. Not pre-designed — grown from practice.

Dual-path entry: explore or execute

Every workstream starts at one of two entry points, depending on how well the problem is understood.

Unknown territory                    Known territory
       │                                    │
       ▼                                    ▼
  Battle-test                          Thesis directly
  (ws-009)                             (ws-013)
       │
       ▼
  Observe failure → Form thesis
       │
       └──────────┬──────────────────┘
                  ▼
             Challenge

Battle-test first. ws-009 built a deliberately fragile coding-agent first. It observed three specific failure modes — every “hi” cost 5 LLM calls, error handling lived exclusively in CALL_LLM, and JSON contracts were a poor fit for conversational turns — before forming the thesis that became ws-011.

Thesis first. ws-013 knew the ToolPipeline protocol existed but hadn’t been consumed. The problem was clear. No battle-test needed.

Rule of thumb: If you are not sure where the architecture will break, battle-test first. If the problem is already well-defined, skip straight to thesis.

Challenge: oracle reads source, gives line-level verdict

The thesis enters a formal challenge. The oracle receives the thesis plus the critical source code paths and returns a structured verdict.

Thesis → [dispatch oracle with: thesis + key source paths]
                │
                ▼
          Oracle output:
          - Premise-by-premise: HOLDS / BREAKS (with line numbers)
          - Where It Breaks (specific code paths)
          - Blind Spots (what the thesis didn't cover)
          - Category Assessment (demo vs production)
                │
                ▼
          challenge.md
                │
          breaks > 0  →  Response phase
          breaks = 0  →  Contract phase

ws-011’s challenge produced an 11-page oracle verdict identifying 5 breaks and 7 blind spots. ws-010’s challenge proved the direction was fundamentally infeasible — the contract was never written.

What makes this work: The oracle is not asked “is this design good?” It is given source code and asked “prove why and where it breaks.” Line-number citations mean the response can target fixes precisely.

Response → Synthesis → Contract: three layers of translation

A single jump from “challenge found problems” to “start building” is where most processes collapse. The SOP inserts two intermediate translations.

Challenge breaks + blind spots
         │
         ▼
    Response.md                      Each gap gets a solution.
    (tactical layer)                 Specific schemas, signatures, contracts.
         │
         ▼
    Synthesis.md                     Extract shared patterns from
    (architectural layer)            the solutions. Check internal
                                     consistency across all responses.
         │
         ▼
    Contract.md                      Verifiable scope: file boundaries,
    (execution layer)                acceptance criteria with before/action/after.

Layer	Artifact	Purpose
Tactical	Response.md	Per-break design decisions, concrete schema/signatures
Architectural	Synthesis.md	Design philosophy extraction, cross-response consistency check
Execution	Contract.md	File boundaries, verifiable acceptance criteria

ws-011’s Response resolved all 5 breaks and 7 blind spots into 11 protocol definitions. Synthesis checked internal consistency — specifically whether the Policy-at-kernel vs kernel/tool abstraction tradeoff was handled correctly. Contract produced 12 verifiable acceptance criteria.

Why the layering works: Response ensures every identified problem has a concrete solution. Synthesis ensures those solutions don’t contradict each other. Contract ensures the scope is bounded and testable.

Build → Audit → Retry: dual-verification loop

Execution runs inside a two-role verification loop. Build and Audit are never the same agent.

Contract
    │
    ▼
Build                          Produces handoff.md + code
    │
    ▼
Audit (independent role)       Reads contract + handoff + source
    │                           Verifies every AC
    ├─ PASS + deeper insight → Spiral turn (depth++)
    ├─ PASS + no insight     → Archive
    └─ FAIL                  → Retry (attempt++, back to Build)
                                   │
                                   ▼
                              Re-Audit

ws-011’s attempt 0 returned FAIL: 4 ACs passed, 8 failed. Every failure was the same class: “test does not exist” or “test was not updated.” Attempt 1 fixed the P0s — removed a forbidden import, added deprecation warnings, built the missing tests — and passed 12/12.

Why the loop holds: Build and Audit are different roles with different incentives. Build naturally believes “it’s done.” Audit reads only the contract and the evidence — promises are not evidence. ws-011 attempt 0’s handoff claimed “all ACs ready to test.” The audit report said: “zero tests were updated.”

What it adds up to

Move	What it prevents
Battle-test	Building on an unvalidated assumption
Challenge	Letting a weak thesis enter execution
Response → Synthesis → Contract	Jumping from critique straight to code
Build → Audit → Retry	Self-approval without independent verification

None of this was designed upfront. It crystallized across five workstreams because each missing step hurt. The SOP is just a record of which gaps were painful enough to fix.