The pipeline was doing too much work
agents-stack v1.6.3 shipped with 8 distinct LLM worker phases. Every phase was a standalone SKILL.md, a bounded context window, and a formal handoff protocol. The assumption was: each stage of an agentic workflow — brainstorm, propose, review, execute, audit, reconcile — deserves its own dedicated agent with its own prompt and its own tool wall.
That assumption is wrong. Some stages don’t need an LLM at all. Some pairs of stages are the same thing split in half. And some problems only become visible when you stop treating the pipeline as a linear assembly line.
v2.0 tears 16 phases (8 worker phases + 8 infrastructure phases) down to 7. The reduction is not from cutting features. It is from removing ceremony.
What we removed and why
| v1.6.3 | v2.0 | Verdict |
|---|---|---|
project-initializer (187-line LLM prompt) | init.sh (bash script) | Doesn’t need AI — it creates directories |
generator-brainstorm + generator-proposal | thesis | Two phases doing two halves of the same thing |
evaluator-contract-review | challenge | Same adversarial gate, but delegates judgment to oracle/council |
generator-execution | contract + build | One phase was doing two distinct mental operations |
adversarial-live-review (449 lines) | audit (107 lines) | Same invariant, 76% shorter |
state-update (419 lines) | Orchestrator routing logic | Mechanical state sync doesn’t need an LLM |
compound-capture | plan.md Lessons section | Cross-sprint learning is a paragraph, not a phase |
The biggest deletion is project-initializer. v1.6.3 used a full LLM worker to create directories and write default JSON. Three thousand tokens for mkdir and template writes. v2.0 runs a bash script.
state-update was 419 lines of SKILL.md dedicated to: read a verdict, update JSON fields, move files to an archive folder. All mechanical. All deterministic. All now embedded in the orchestrator’s post-audit routing logic — a few if branches, not an agent.
The new architecture: Direction → Method → Action
v1.6.3 was a flat pipeline. Every phase ran in sequence. Every phase had the same structural weight. If execution failed, you retried the same contract. If review found a defect, you fixed the code and ran review again.
v2.0 replaces this with three layers and a spiral:
Direction Layer: thesis → challenge
Method Layer: response → synthesis ★ new
Action Layer: contract → build → audit
↓ (deeper insight found?)
return to thesis (depth++) ★ new Two new capabilities that v1.6.3 never had:
Method Layer: In v1.6.3, the gap between “challenge found a problem” and “here’s the build contract” was invisible. The pipeline jumped straight from adversarial review to execution. v2.0 inserts
response → synthesis— a design stage for deciding how to respond to gaps revealed by the challenge. You can change your approach before you commit to a contract.Spiral turn: In v1.6.3, audit could only say PASS or FAIL. v2.0’s audit can also surface deeper insight — “this result reveals a problem with the underlying thesis.” That triggers a return to the Direction Layer at increased depth: reform the proposition, re-challenge, re-build. This is the difference between fixing bugs and discovering that the bugs trace back to a wrong assumption.
Why phases got merged, not just deleted
The merge of brainstorm + proposal into thesis is worth examining. These were the most popular phases in v1.6.3. Brainstorm explored the problem space; proposal formed a claim about the solution.
But in practice, they were the same cognitive operation performed twice. First you clarify the problem. Then you form a claim about it. Separating them into two phases meant paying for two context windows, two tool walls, and two handoff protocols — for a mental process that, in a human, happens in one sitting.
The thesis worker does both: ground in prior evidence, form the narrowest honest claim, and self-attack before handing off to challenge. One phase. One cost. No handoff gap where information leaks out.
The same logic applies in reverse to generator-execution. That phase was 337 lines of SKILL.md doing two genuinely different things: defining what to build (contract boundaries, allowed files, AC acceptance criteria) and actually building it (writing code, build/startup triage, writing handoff.md). v2.0 splits them into contract and build because the mental operations are different, and because contract can fail independently — if the boundaries are wrong, you don’t want to discover that after spending tokens on code.
The numbers
| Metric | v1.6.3 | v2.0 |
|---|---|---|
| Worker phases | 8 | 7 |
| Pipeline phases (including infrastructure) | 16 | 7 |
| Python maintenance scripts | 6 (3,138 lines) | 0 |
adversarial-live-review SKILL.md | 449 lines | 107 lines |
| Evidence priority tiers | 12 | 4 |
State files (roadmap.md + current-focus.md + ideas.md) | 3 separate files | 3 sections of plan.md |
| Dispatch packet fields | 5 | 3 (SKILL.md path + workstream ID + artifact path) |
The largest single deletion: using-agents-stack/scripts/ — 3,138 lines of Python validation and dispatch scripts. Every one of them was doing something the orchestrator can do directly. If the orchestrator already reads the state and makes the routing decision, writing Python scripts to validate that decision is just paying twice for the same logic.
The principle behind every cut
Every deletion in v2.0 traces back to the same question: does this step need an LLM agent, or is it just ceremony?
Moving files to an archive folder? Ceremony. Appending a lesson to plan.md? Ceremony. Writing a dispatch packet when the orchestrator already decided where to route? Ceremony.
v1.6.3 treated agent phases as the default unit of work. If something needed doing, it got a phase. v2.0 inverts that: the default is no phase. An LLM worker only exists when the task genuinely requires reasoning under uncertainty — forming a thesis, challenging a claim, synthesizing a response, writing code, auditing results.
Everything else is a bash script, a routing rule, or a paragraph appended to a file.
The lesson
Agent pipelines tend toward inflation. Every step looks like it should be its own agent. Every gap between steps looks like it should be a handoff protocol. Every state change looks like it should be a separate phase with its own SKILL.md.
v2.0 is an exercise in pushing back against that gravity. The result is not a smaller product. It is a sharper one — fewer handoffs, fewer context window resets, fewer places where mechanical work is dressed up as reasoning.
The surprising finding from building this: most of what makes an agent pipeline feel “professional” is just ceremony. The actual work — clarifying problems, forming claims, writing code, auditing results — happens in far fewer steps than the pipeline architecture suggests.
Sometimes the best thing you can do for an agent system is remove the parts that made it look like a system.