June 8, 2026
Route by difficulty, not by default
When Apple rebuilt Siri, it didn't pick one model and send everything to it. A timer request stays on your phone. A medium query goes to Apple's private servers. Only the hardest reasoning reaches Google's giant model. That three-tier split isn't an Apple quirk — it's the pattern every serious AI product is converging on, because sending every request to one big model overpays on the easy ones and over-exposes the sensitive ones. The fix is routing, and most builders skip it.
Buried in the Siri rebuild is an architecture decision worth more attention than the Gemini headline. Apple's new Siri uses a three-tier routing system that decides where each request is handled. Simple stuff — set a timer, play a song — runs entirely on the phone, no data leaving the device. Moderately complex requests go to Apple's own Private Cloud Compute, processed and immediately forgotten. Only the heaviest reasoning is sent out to the giant Gemini model in Google's cloud.
Notice what Apple did not do: pick one model and route everything to it. And that's the lesson, because picking one model for everything is exactly what most people building AI products do — and it's quietly the wrong default on two different axes at once.
One model for everything is wrong twice
Send every request to a single big model and you make two mistakes simultaneously.
The first is cost. Most requests are easy. "Reformat this date," "is this email spam," "summarize one paragraph" — these don't need a frontier model any more than addition needs a supercomputer. Routing them to your most expensive model means paying premium prices for trivial work, on every single call, forever. Research on difficulty-based routing shows you can cut calls to the big model by around 40% with no drop in quality, just by sending easy work to a small model and escalating only when it's actually hard. That's the cheap-model-for-most-of-the-work idea, made into infrastructure.
The second mistake is exposure. Some of your requests contain sensitive data — health details, financial records, private messages. Sending those to a third-party cloud model is fine for a recipe lookup and a serious problem for a medical record. One model for everything means your most sensitive data rides the same path as your most trivial, out to whoever hosts the model. Apple's whole point is that the timer and the private query should not travel the same road.
Routing fixes both at once. And the two axes — how hard is this, and how sensitive is this — are the whole design.
The two questions that decide the route
Before a request hits a model, ask two things about it:
How hard is it? Route by difficulty. Default everything to the cheapest, fastest model that can plausibly handle it, and escalate to a bigger one only when the small one isn't good enough. This is the "cascade" pattern: try local or cheap first, promote to the expensive model on failure — not the other way around. The expensive model becomes the exception, not the default, and your bill follows.
How sensitive is it? Route by data, not just cost. Genuinely sensitive requests should stay on the most private tier you have — on-device or your own infrastructure — and, importantly, should never silently fall back to a public cloud if the private path is busy. The discipline here is "fail closed": if you can't process sensitive data privately, you refuse, you don't quietly ship it to a third party. Apple enforces this with anonymization and contracts that stop Google from training on user queries; your version might be simpler, but the principle is the same — sensitivity decides the path, and the safe failure is "don't," not "send it anyway."
Why most builders skip this
Routing is more work than calling one endpoint, so the honest reason people skip it is that one model for everything is easy to ship. You wire up the frontier model, it handles every case, done. It works — it's just expensive and leaky in ways you don't see until the bill or the breach arrives.
But you don't need Apple's three tiers to get the benefit. Even a crude version pays off: a cheap model as the default, an escalation to a strong model when a confidence check or the task type says "this is hard," and a hard rule that flagged-sensitive requests stay on a path you control. That's a few hours of plumbing that cuts cost meaningfully and shrinks your exposure surface at the same time. The sophistication can come later; the shape — default cheap and private, escalate on purpose — is what matters.
The bottom line
The flashy part of the Siri story is that Apple rented Google's brain. The useful part is what Apple put in front of that brain: a router that sends each request to the smallest, most private place that can handle it, and reaches for the big expensive cloud model only when it has to. That's not an Apple luxury. It's the pattern that falls out the moment you take cost and privacy seriously, and it scales all the way down to a solo project.
So stop sending everything to one model by default. Ask the two questions — how hard, how sensitive — and let the answers pick the route. Most of your traffic is easy and unremarkable, and routing it accordingly is the difference between an AI product that's cheap and safe by design and one that's expensive and exposed by accident. One model for everything isn't simplicity. It's a default you never actually chose.
Comments
No comments yet
Sign in to join the conversation.
Be the first to share a thought.