All notes
Reach for the small model first

June 13, 2026

Reach for the small model first

The reflex is to send every task to the biggest, smartest model. The numbers say that's usually the wrong default. A 7-billion-parameter small model runs 10–30× cheaper than a 70–175B one, Microsoft's Phi matches GPT-3.5-class quality at 98% less compute, and over two billion phones already run capable models locally with no cloud at all. Gartner expects task-specific small models to be used three times more than general LLMs by 2027. Here's why 'small first' is becoming the smart default — and when to still reach for the big one.

Most people building with AI have one reflex: send the task to the biggest, smartest model available. It feels safe — why give the job to a weaker model? But the 2026 numbers say that reflex is usually the wrong default, and reaching for the small model first is the move that's quietly winning.

Start with the economics, because they're not subtle. Serving a 7-billion-parameter small model is 10–30× cheaper than running a 70–175B large one, and enterprises moving the right work to small models are cutting AI costs by up to 75%. This isn't a quality sacrifice for a discount, either: Microsoft's Phi-3.5-Mini matches GPT-3.5-class performance while using about 98% less compute. The gap between "small" and "good enough" closed while everyone was watching the frontier.

Let me make the case for flipping your default, because it changes both your bill and your architecture.

"Small" stopped meaning "weak"

A couple of years ago, picking a small model meant accepting visibly worse output. That trade is mostly gone for the work most apps actually do. Today's small models handle classification, extraction, routing, summarization, structured-data tasks, and straightforward coding at a quality that's hard to distinguish from the giant — on exactly the kinds of tasks that make up the bulk of a real product.

The frontier model is still better at the genuinely hard stuff: deep multi-step reasoning, novel problems, the long tail of edge cases. But here's the thing most apps get wrong — the hard stuff is a minority of the calls. You're paying frontier prices to classify support tickets and reformat JSON. That's the same point as a cheap model doing 90% of the work: the expensive model is the exception you escalate to, not the default you start from.

The new superpower: it runs on the device

There's a second reason "small first" matters, and it's not just cost. Small models run where big ones can't. Over two billion smartphones now run capable local models, and a 1-billion-parameter model fits in about 650MB of RAM and runs on a phone at reading speed. A small enough model means no cloud round-trip at all.

That unlocks things a cloud API never can. The data never leaves the device, which is a privacy and compliance answer, not just a latency one. There's no per-token bill, no rate limit, no outage to ride out, and it works on a plane. For a whole class of features — on-device assistants, private extraction, anything latency- or privacy-sensitive — the small local model isn't the budget option, it's the only option that has those properties. Gartner expects task-specific small models to be used three times more than general-purpose LLMs by 2027, and this is a big part of why.

When to still reach for the big one

"Small first" is a default, not a religion. Reach for the frontier model when the task actually demands it:

  • Hard, open-ended reasoning — multi-step problems, ambiguous goals, novel work with no clean pattern to follow.
  • Breadth you can't predict — a general assistant fielding anything a user might ask, where you can't scope the task in advance.
  • When you haven't measured yet — prototype on the big model to learn what "good" looks like, then move the settled, repetitive paths down to a small one.

The pattern that wins is routing by difficulty: small model by default, escalate to the big one only when the task earns it. What flips is the starting assumption — from "use the best model and economize later" to "use the smallest model that clears the bar and escalate when it doesn't."

The bottom line

The instinct to always grab the most powerful model is a holdover from when small models were genuinely bad. They aren't anymore. For the bulk of real-world tasks, a small model is 10–30× cheaper, often runs on the device with no cloud at all, and produces output you can't tell apart from the giant's. Defaulting to the frontier for everything is now mostly a way to overpay and over-engineer.

So flip the reflex. Reach for the small model first, measure whether it clears the bar — it usually does — and save the expensive model for the minority of tasks that genuinely need it. The teams doing this aren't sacrificing quality for cost. They're matching the tool to the job, and getting a smaller bill, lower latency, and better privacy as the reward for it.

Comments

No comments yet

Sign in to join the conversation.

Be the first to share a thought.