AGENTS · June 3, 2026

A cheap model can do 90% of the work

The default move is to point the biggest, smartest model at everything. It works in the demo and quietly bankrupts you at scale — because most of what an agent does isn't reasoning, it's mechanical, and you're paying genius wages to read a form. The fix is boring and worth ~90%: let a smart model plan, and cheap models do. Here's the economics, and the one architectural rule that makes it possible.

When people build their first agent, they reach for the best model for everything. Of course they do — it's the safe choice, it gives the best answers, and in a demo the cost is a rounding error. Point the smartest model at the whole problem and watch it work.

Then it goes to production, the call volume goes up a few orders of magnitude, and the bill arrives. And it turns out "just use the best model" was not a neutral default. It was a decision to pay top price for every trivial step — and most of the steps are trivial.

The price spread is enormous, and nobody looks

Here's the number that should reframe the whole thing. In 2026, LLM API prices span from about $0.10 per million input tokens for small models to $30 per million for frontier reasoning models — a spread of roughly 300x for tokens that, on an easy task, produce indistinguishable output.

CloudZero put the practical version of this bluntly: using a frontier model for a workload that doesn't need frontier reasoning can cost 16x more with no meaningful improvement in quality. Sixteen times the price for the same answer. You would never accept that anywhere else in your stack, but with models most teams never even check — they pick the best one once and route everything through it.

Most of what an agent does isn't thinking

The reason this is so wasteful is that an agent task is not one big act of genius. Break a real one open and you find a handful of genuinely hard steps buried in a pile of mechanical ones: extract a field from this JSON, classify this message, reformat that list, decide which tool to call, summarize a paragraph, fill in a template. The hard part — understanding the goal and planning the approach — might be one step in fifty.

Running a $30-per-million reasoning model on "pull the order ID out of this text" is paying a senior engineer's hourly rate to alphabetize a drawer. It's not wrong because it fails — it works fine. It's wrong because you're buying capability you're not using, fifty times per task, forever.

The pattern: a smart model plans, cheap models do

The fix has a name — plan-and-execute — and it's exactly what it sounds like. One capable (expensive) model looks at the request and produces a plan: the reasoning, the decomposition, the strategy. Then cheap, fast models execute each step of that plan, because once the thinking is done, the steps are mechanical and a small model does them just as well.

LangChain's writeup and others put the savings at up to 90% versus using a frontier model for everything. This isn't a fringe trick. It's close to how Klarna runs: a frontier model analyzes the customer's intent and maps out the resolution steps, and smaller models do the actual work — pulling account data, processing the refund, generating the reply. Spend the genius once, on the part that needs it; buy the rest in bulk.

And here's the part that surprises people: quality usually doesn't drop. On a narrow, well-defined step, a small model is often better — faster, more predictable, and less likely to "get creative" with a task that wanted exactly one boring thing. Matching the model to the job isn't a sacrifice. The oversized model was never adding anything on those steps in the first place.

The rule that makes it possible: never hard-code the model

There's a catch, and it's architectural. You can only route by task if your code isn't welded to one model. If gpt-whatever is hard-coded in the middle of your business logic, you can't swap a cheap model in for the mechanical steps without surgery — so you don't, and you keep overpaying.

This is the same discipline I keep coming back to: the model is a dependency, and dependencies belong behind a boundary, injected, not baked in. A hard-coded model name in your logic is a smell for the same reason a hard-coded price or a hard-coded API key is — it's a swappable detail masquerading as a fixed fact. Get it behind a clean seam and "use a cheaper model here" becomes a config change, not a refactor. (I swapped the entire provider behind one of my products by changing a single value — that's not luck, it's just not hard-coding the thing that was always going to change.)

Once the model is a swappable parameter, routing falls out naturally: a light classifier or a few heuristic rules decide each step's complexity and send it to the right tier. Routing alone — cheap model for the simple majority, frontier model for the hard minority — is commonly reported to cut 60–80% of cost while keeping quality on the hard cases.

Why the wasteful default wins anyway

If this is so obviously better, why doesn't everyone do it? The same reason cheap architecture wins the meeting: the waste is invisible exactly when you're deciding. In the demo, "best model for everything" is simplest to build and costs cents. The bill that makes it a mistake doesn't arrive until you've scaled — at which point it's a five- or six-figure line item attached to code that hard-coded the model everywhere, so fixing it is now a project. The lazy default is cheap to choose and expensive to live with. You didn't save effort; you deferred a bill and let it compound.

There's also a status thing: reaching for the biggest model feels like the serious, safe choice. It isn't. Spending frontier money on a formatting step isn't rigor — it's just not having looked. The actual engineering is knowing which 10% of your pipeline needs the expensive brain and having the architecture to send only that 10% there.

The frontier model isn't the flex. Knowing where you don't need it is.

Comments

No comments yet

Be the first to share a thought.