We built an agent framework 6 times before keeping the first version

We set out to build an internal agent framework. A small thing. Governance built into the runtime — budget enforcement, PII scanning, audit chains. Not a platform. Not a product. Just something our team could use to ship AI agents fast.

The first design was simple.

The answer was right there

while True:
    budget.check()
    response = llm.call(messages, tools)
    budget.record(response.cost)
    if not response.tool_calls:
        pii.scan(response)
        audit.emit(response)
        memory.append(response)
        return response.content
    results = execute_tools(response.tool_calls)
    messages += results

Seven files. One while loop. Every LLM call budget-checked before it fires. Every response PII-scanned before storage. Every turn audit-logged with hash-chain integrity. print() works. You can debug it with a breakpoint.

We looked at it and thought: this is enough.

Then we spent weeks making it wrong

This is where the story should end. It didn’t.

We asked an Oracle agent to review the design. It found gaps — missing patterns. Reflection. Human-in-the-loop. Multi-agent collaboration. Pipeline orchestration. FanOut.

We took every gap and turned it into a feature. Not deferred. Not questioned. Added.

Iteration	Files	What happened
Original	6	while loop + governance hooks
v1.0	14	+RAG, parallel execution, evaluate
v1.1	14	Fixed 4 P0 + 7 P1 issues
v1.2	17	Crammed all v2 features into v1
v1.3	17	God Object. Everything in one file
v2.0	20	Clean Architecture: 4 layers, 8 ABCs, DI container

At v2.0, to understand what agent.ask() did, you had to trace through:

AgentFacade → DIContainer → AskAgentUseCase → TurnPipeline
→ LLMPort → LLMAdapter → OpenAIProvider

Seven layers. For a while loop.

The design doc grew from 555 lines to 1423 lines. The number of files tripled. The core never changed — the while loop was still there, just buried under eight abstract base classes and five use cases.

Nobody stopped to ask: can we cut something?

Why reviews made it worse

Six rounds of Oracle review. Every round said: add this, fix that, cover this edge case. Not one round said: remove this. It shouldn’t exist.

The review mechanism itself has a bias toward addition. Finding “what’s missing” is easy. Finding “what’s unnecessary” is hard. The agent defaults to pattern-matching — it sees a gap and maps it to a known solution. It never asks whether the gap matters.

We were reviewing our way into a worse system. More complete. More architecturally correct. Less useful.

The user said “this is too complex” three times across those weeks. We heard it every time. We nodded. And we kept adding layers.

When the user says “too complex,” the right response is not “but this is architecturally correct.” The right response is: stop. Remove something.

What we got back to

Seven files:

framework/
├── __init__.py
├── agent.py       # The while loop
├── tools.py       # @tool decorator, timeout, retry
├── memory.py      # Hash-chain with physical deletion
├── budget.py      # Pre-call check, usage tracking
├── audit.py       # PII scanner + structured audit log
├── context.py     # Sliding window, rolling summary
└── llm.py         # OpenAI, Anthropic dispatcher

from framework import Agent, tool

@tool(timeout=10, retries=2)
def web_search(query: str) -> str:
    """Search the web. Use for factual lookups."""
    ...

agent = Agent(
    model="anthropic/claude-sonnet-4-20250514",
    tools=[web_search],
    budget={"max_per_call": 0.50, "max_per_hour": 10.0},
)

reply = agent.ask("Tell me about AI agents.")
print(reply)
print(agent.usage)  # {"cost": 0.032, "tokens": 1240}

No DI container. No ABCs. No use cases. One decorator is the only abstraction for tools. The stack is one layer deep. If something breaks, you can find it.

What we learned

The principles that survived are not about architecture. They’re about discipline.

Reviews must include a “cut” step. Every review defaults toward addition. After every review, ask: what did we add? What can we remove? If the answer to the second question is nothing, the review wasn’t finished.

AI adds. Humans subtract. LLMs — including review agents — over-design, pattern-match, and never proactively clean up. The human’s core value in an agentic workflow is not generating more designs. It’s judging which designs shouldn’t exist.

Docs getting longer means something is wrong. Good design makes documentation shorter. Our spec went from 555 to 1423 lines. That wasn’t progress. That was accumulating explanations for complexity that shouldn’t exist.

Useful beats correct. v2.0 Clean Architecture was architecturally correct. DIP, SRP, dependency direction — all textbook. But for a 1-3 person internal tool, a single while loop is what’s useful. Correctness follows usage, not design principles.

Evidence, not guesses. “Future need” is not need. Eight ABCs built for “someday we might switch providers” when the probability of switching providers is approximately zero. Wait until the evidence arrives. It will be more specific than your guesses.

The user is usually right. When someone says “this is too complex,” believe them. Don’t respond with architecture justifications. Remove something and see if it still works.

These aren’t lessons about AI. They’re lessons about what happens when a system has to be simple enough for an agent to build in, and clear enough for a human to review. That tension — between completeness and clarity, between correctness and usefulness — exposes every unnecessary abstraction you’ve ever added.

We designed six versions of the same framework. The first one was right. The fifth one was Clean Architecture. The first one won.

For an internal tool, it almost always does.