June 3, 2026
Orchestration is the architecture now
Split the god-agent into ten focused ones and you trade a model problem for a systems problem: now they have to work together, and coordination is harder than any single agent. Most teams treat that wiring as plumbing. It isn't — it's the architecture, it's a distributed system, and it fails like one. Here's what orchestration actually is, how it breaks, and why you shouldn't reach for it until you can name the bottleneck.
Last time I argued that you should split the god-agent into small, focused agents that each do one thing well — and I ended on a catch: doing that doesn't delete your problem, it trades it for a new one. You now have ten agents that have to work together, and getting them to do that turns out to be harder than anything a single agent was struggling with.
That coordination is the subject of this post, because the way most teams handle it is the quiet reason their multi-agent systems fall over.
Coordination isn't the plumbing. It's the system.
When people build a multi-agent setup, they think of the agents as the system and the wiring between them as plumbing — connect the outputs to the inputs and you're done. That's backwards. Once the work is split across specialists, the agents are the easy part. The coordination is the architecture now — and it's the part almost nobody designs on purpose.
There's a name for the layer: orchestration. And it's not a wire, it's a set of real responsibilities — task decomposition, routing, state management, result aggregation, and error handling and escalation. The dominant shape in production is the orchestrator-worker pattern — about 70% of deployments — where one capable model receives the task, breaks it down, dispatches each piece to a cheap specialist worker, and assembles the results. (That split, a smart planner over cheap doers, is also where a lot of the cost savings come from — 40–60% in reported cases.) That orchestrator is not glue. It's the most important component you'll write.
It's a distributed system, and it fails like one
Here's the mental shift that makes the whole thing tractable: a multi-agent system is a distributed system. Independent components, passing messages, holding state, failing partially. And the moment you see it that way, its scary failure modes stop being mysterious "AI problems" and become the classic distributed-systems problems you already know:
- Cascades. One agent hallucinates, hands its wrong answer to the next as if it were fact, and the error compounds down the chain — a corrupted message propagating through a pipeline.
- Runaway loops. Two agents bounce a task back and forth, or one retries forever, quietly running up your API bill — an unbounded retry with a meter attached.
- Silent message loss. A handoff overflows a context window, critical information gets truncated without anyone noticing, and a downstream agent acts on a half-message — dropped packets, no error raised.
- Dead escalation. The "escalate to a human" path that everyone designed and no one tested never actually fires — the error handler that was never on the happy path.
This isn't hypothetical fragility. Multi-agent systems have been measured failing in production at rates between 41% and nearly 87%, with coordination breakdowns alone accounting for 36.9% of all failures — not the models being dumb, the coordination being unmanaged. The fix is to bring the discipline the failure modes are begging for: explicit contracts between agents, bounded loops and retries, state you can observe, and error and escalation paths you actually exercise. Distributed-systems problems want distributed-systems answers.
Don't reach for it until you can name the bottleneck
There's a second trap, the opposite of treating coordination as plumbing: reaching for multi-agent because it sounds powerful. It is powerful, and it is expensive. Coordination has a real cost — multi-agent systems can burn 15× the tokens of a single-agent interaction, plus large overhead just shuttling state around. The 2024 dream that "more agents = more intelligence" mostly died in production.
The consensus that replaced it, which five major vendors — Anthropic, OpenAI, Cognition, LangChain, AutoGen — converged on, is a useful rule: the burden of proof is on multi-agent, not single-agent. Start with one agent. Add another only when you can name the specific bottleneck that forces it — domain isolation, genuinely parallel work, a compliance boundary. "It feels more sophisticated" is not a bottleneck. Every agent you add buys capability and pays for it in coordination, and that trade is only worth it when something concrete demands it.
The real failure is doing it by accident
Put the two traps together and the actual problem comes into focus. Most teams don't decide to build a multi-agent system; they grow one. They bolt an agent onto an agent to patch a gap, then another, and the orchestration emerges implicitly — no chosen pattern, no contracts, no failure handling, just accumulated wiring nobody designed. That's exactly how you land in the 86% column. Inquiries into multi-agent systems have exploded, but the coordination is mostly improvised, and improvised distributed systems fail.
The fix isn't a framework. It's a stance: treat orchestration as a first-class design artifact. Decide the pattern on purpose. Write down the contract each agent honors. Decide what happens when a step loops, truncates, or fails — before it does. The orchestrator deserves the same deliberate design you'd give any critical system, because that's what it is.
When building got cheap and you split the work into specialists, you didn't remove the architecture — you relocated it. It used to live inside one agent's prompt; now it lives in the coordination between many. Design that coordination like the distributed system it is, on purpose — or watch it fail like one you pretended was just plumbing.
Comments
No comments yet
Sign in to join the conversation.
Be the first to share a thought.