Express course · No. 36
Most AI runs in a data center, reached over the network. But a fast-growing class runs right on the phone, laptop, or sensor — no cloud round-trip at all. Small models, made small enough to run locally, unlock privacy, offline use, instant response, and zero per-call cost. Learn what edge AI makes possible, how a model gets small enough to fit, and the trade-offs you take on.
Essence only · One picture per idea · Small, local, and yours
The whole topic comes down to one choice: where does the model actually run? Understanding that fork — and what each side costs — frames everything else.
Most AI runs in a data center, over the network
Phoning a distant office for every answer — your question travels there, an expert replies, and the answer travels back, every single time.
By default, an AI model runs in the cloud — on powerful servers in a data center — and your device reaches it over the internet. You send the request, it's processed remotely, and the response comes back. This is how most AI works, and it's why the biggest, smartest models are possible: they need hardware far beyond a phone. But it also means every single call makes a network round-trip to someone else's computer.
On-device AI runs right where you are
An expert who lives in your house instead of across town — you just ask, and the answer comes immediately, no phone call, no travel.
On-device (or edge) AI runs the model locally — on the phone, the laptop, the sensor, the car — with no trip to a server. The computation happens right where the data is, on the hardware in your hand. "Edge" means the edge of the network, away from the central data center, out where the users and devices are. This is the alternative to the cloud: instead of sending your request away to be answered, the answer is computed on the spot.
The round-trip is the difference
Cooking at home versus ordering delivery — the meal can be the same, but one involves a courier crossing town each time and the other doesn't.
The core difference between cloud and on-device isn't the AI itself — it's whether your data leaves the device and whether you wait for a network round-trip. That single distinction drives every benefit and trade-off in this course: keep the work local and you gain privacy, offline ability, and speed, but you're limited to what the device can run; send it to the cloud and you get the biggest models, but you pay in round-trips, cost, and your data leaving. Where the model runs decides almost everything.
Cloud AI runs on distant servers reached over the network; on-device (edge) AI runs locally with no round-trip. Whether the data leaves and whether you wait for the network drives every trade-off.
Running a model on the device isn't just a technical curiosity — it unlocks four concrete benefits the cloud can't match, each of which can be the deciding reason to go local.
Privacy: the data never leaves
Keeping your diary in a locked drawer at home versus mailing each page to a company to read — one keeps the secret with you, the other doesn't.
The biggest reason to run locally is privacy: if the model is on the device, the data it processes never has to leave — no sending personal photos, private messages, or sensitive records to a server. For anything users would rather keep on their own hardware, or that regulation says can't be shipped to a third party, on-device is the answer. The data being processed where it lives, instead of traveling to someone else's computer, is a privacy guarantee no cloud service can fully match.
Offline: it works with no connection
A paper map keeps working in a tunnel where the online one goes blank — local doesn't depend on a signal.
An on-device model works offline — on a plane, in a tunnel, in a remote area, anywhere with no reliable connection — because it needs no server. Cloud AI simply stops when the network does. For features that have to work everywhere, or in places connectivity can't be assumed, running locally isn't an optimisation, it's the only option. Independence from the network is a capability the cloud fundamentally can't offer.
Speed and cost: no round-trip, no bill
Answering from memory versus calling someone each time — instant, and free, instead of delayed and metered.
Two more benefits come for free with local. Latency: with no network round-trip, an on-device model can respond almost instantly, with none of the delay of reaching a distant server. Cost: the model runs on hardware the user already owns, so there's no per-call bill — you're not paying a provider for every request. For a high-volume or latency-sensitive feature, "instant and free per call" is a powerful combination the cloud, with its round-trip and its meter, can't offer.
Running locally unlocks four things the cloud can't match: privacy (data never leaves), offline use (no network needed), instant response (no round-trip), and zero per-call cost (the user's own hardware).
A phone can't run a giant frontier model. So on-device AI depends on making models small enough to fit — and there are a few standard ways to shrink one.
Small language models fit where big ones can't
A pocket reference instead of a wall of encyclopedias — far less comprehensive, but it fits in your pocket and is there when you need it.
A device has limited memory and processing power, so on-device AI uses small language models (SLMs) — models with far fewer parameters than the giants, deliberately built to be compact. They can't know or do everything the largest models can, but they're small enough to run on a phone or laptop. The whole field of edge AI rests on these: the trend toward capable small models is what made running real AI locally practical at all.
Quantization: less precision, much smaller
Storing a measurement as "about 3.1" instead of "3.14159265" — you lose a little accuracy but the number takes far less space, and for most purposes it's just as good.
A key technique for shrinking a model is quantization: storing its internal numbers at lower precision — fewer digits, roughly — so the whole model takes dramatically less memory and runs faster. You sacrifice a small amount of accuracy for a large reduction in size, which is usually a great trade for fitting on a device. Quantization is how a model that wouldn't fit on a phone gets squeezed down to one that does, often with barely noticeable quality loss.
Distillation: a small model learns from a big one
An apprentice who learns the master's craft for one specific job — not everything the master knows, but enough to do that job nearly as well, in a fraction of the size.
Another technique is distillation: training a small model to imitate a large one, so the small model captures much of the big one's ability in a far more compact form. The large "teacher" model's behaviour is transferred into a small "student" that's cheap enough to run locally. Between small models built compact from the start, quantization to shrink them further, and distillation to transfer capability, a useful model can be made small enough to live on a device.
On-device AI relies on small models: small language models built compact, quantization to store them at lower precision, and distillation to transfer a big model's ability into a small one that fits on a device.
Shrinking a model isn't free — a small model genuinely can't do everything a giant can. Being honest about that gap is how you decide what belongs on the device.
A small model knows and reasons less
A pocket calculator versus a research lab — the calculator is instant and always with you, but you wouldn't ask it to design a bridge.
There's no free lunch: a small model is genuinely less capable than a large one. It has less general knowledge, handles complex reasoning less reliably, and is weaker on hard, open-ended tasks. The compactness that lets it run on a phone is paid for in raw ability. So on-device AI isn't simply cloud AI made local — it's a deliberately weaker model in exchange for privacy, offline use, and speed. Pretending the small model is just as smart is how edge features disappoint.
It's great for routine, not frontier-hard work
A skilled local handyman handles most jobs around the house perfectly, and you call the specialist only for the rare, genuinely difficult one.
A small on-device model is well-suited to routine, well-scoped tasks — classifying text, simple extraction, transcription, autocomplete, straightforward assistance — the same kinds of work that don't need a frontier model anyway. It struggles on the genuinely hard, novel, multi-step problems where the biggest models earn their keep. This maps neatly onto difficulty: most everyday tasks are well within a small model's reach, and only the hard minority truly needs the cloud giant.
Match the task to what the device can do
You bring the right tool for the job — the small one for the common task, the big one only when the job genuinely demands it.
The discipline is matching the task to the model the device can run. If a feature's work is routine enough for a small model, on-device wins on privacy, offline, latency, and cost. If it genuinely needs frontier-level reasoning, the device can't deliver it and you need the cloud. Knowing where that line falls — what a small local model can and can't do well — is the core judgement of edge AI. Push too-hard work onto a small model and quality suffers; keep cloud-worthy work local and the feature underperforms.
A small model is genuinely less capable — less knowledge, weaker on hard reasoning. It's great for routine, well-scoped work and poor at frontier-hard problems, so match the task to what the device can actually run.
Here's the insight that makes edge AI far more powerful than "small means weak" suggests: a small model focused on one job can rival a giant generalist at that job.
A focused small model can match a big general one
A local specialist who does one operation thousands of times beats a brilliant generalist who rarely does it — narrow mastery outperforms broad knowledge on the specific task.
A small model is weak as a generalist, but on one specific, well-defined task it can match or even beat a much larger general model. A giant model spreads its capacity across everything; a small model fine-tuned for a single job concentrates its limited capacity exactly there. So for a narrow task — your specific classification, your particular extraction — a small specialised model can be both good enough and tiny enough to run locally. Specialization recovers much of what shrinking gave up.
Fine-tuning sharpens a small model for its job
Training an apprentice intensively on the one task they'll do every day — they become excellent at that, even without the master's broad expertise.
The way you get a small model to punch above its weight is to fine-tune it for your specific task (the fine-tuning course): train it on examples of exactly the job it'll do, baking that one skill in deeply. A small model tuned for your narrow use can outperform a much larger general model on that use — and it stays small enough for the device. This is the combination that makes edge AI genuinely competitive: not a weak generalist, but a sharp specialist that happens to be tiny.
Narrow and local is a powerful combination
A tool built for exactly one job, kept right at hand — not the most versatile thing you own, but the fastest and most reliable for that task.
The winning pattern for on-device AI is narrow plus local: a small model that does one thing very well, running right on the device. You give up generality, which a single-purpose feature didn't need anyway, and gain privacy, offline ability, instant response, and zero cost. For a focused feature, this combination can beat a cloud giant outright — faster, cheaper, more private, and just as good at the one thing it's for. Specialization is what turns "small and weak" into "small and excellent, where it counts."
A small model fine-tuned for one specific task can rival a giant generalist at that task. Narrow plus local — a sharp specialist running on the device — turns "small and weak" into "small and excellent, where it counts."
You don't have to choose cloud or device for everything. The most powerful designs use both — handling what they can locally and reaching for the cloud only when they must.
Local for the common case, cloud for the hard one
A clinic where the nurse handles the routine visits on site and refers only the complicated cases to the distant specialist hospital — most needs met locally, the rare hard ones escalated.
The strongest pattern is hybrid: run a small model on the device for the common, routine, or private work, and escalate to a powerful cloud model only for the genuinely hard cases. Since most requests are easy, most are handled locally — fast, free, and private — and only the difficult minority makes the round-trip to the cloud giant. This is the routing idea from model economics, applied across the device-cloud boundary: the cheapest, most local option by default, the heavyweight only when earned.
Keep the private and offline parts on the device
You handle your own sensitive paperwork at home and only send out the parts that genuinely need an outside expert — keeping private what can stay private.
A hybrid design lets you put the privacy-sensitive and must-work-offline parts on the device, while still using the cloud for the heavy reasoning that needs it. The personal data can be processed locally and never leave; only non-sensitive, genuinely hard work goes out. So you don't have to trade away privacy to get capability, or capability to get privacy — you architect the system so each piece runs in the place that fits its needs. The boundary itself becomes a design tool.
The device-cloud split is an architecture decision
Deciding which work stays in the local branch and which goes to head office — a deliberate division of labour, not an all-or-nothing choice.
Treating "where does this run" as a per-feature decision — like the layers of cloud architecture or the rungs of the LLM ladder — is the mature approach. Some work belongs on the device for privacy, offline, latency, or cost; some belongs in the cloud for capability; and a good system places each where it fits. The device-cloud split isn't a single global choice but an architecture you design, putting each piece of work where its particular needs are best served.
The strongest pattern is hybrid: a small local model for the common, private, offline work, escalating to a cloud giant only for the hard cases. The device-cloud split is an architecture you design, piece by piece.
Using edge AI well comes down to recognising when local genuinely wins, and being honest about the capability you trade for it.
Reach for on-device when local genuinely wins
You choose to keep a task in-house when privacy, speed, offline, or cost makes it clearly better there — and outsource only when you truly need the outside expert.
Go on-device when one of its benefits is decisive: the data must stay private, the feature must work offline, the response must be instant, or the per-call cost must be zero — and the task is routine enough (or specialised enough) for a small model to handle. Don't force a genuinely hard, general task onto a weak local model just to avoid the cloud, and don't ship private data to the cloud when it could have stayed home. Match the location to what actually matters for the feature.
Be honest about the capability trade
You accept that the pocket tool isn't the workshop — and you only choose it for jobs it can genuinely do, not by pretending it's something it isn't.
The discipline is honesty about the trade: a local model gives you privacy, offline, speed, and cost in exchange for real capability. Don't pretend a small model is as smart as a giant; instead, scope the on-device feature to what the small model can genuinely do well — narrow it, specialise it, or keep it routine — and route the hard parts to the cloud. Used where its strengths line up with the need, edge AI is transformative; pushed past what a small model can do, it just disappoints. Choose it clear-eyed.
- Is a benefit decisive — privacy, offline, latency, or zero cost — that the cloud can't match? - Is the task routine or specialised enough for a small model to do well? - Can a model fit — small by design, quantized, or distilled — on the target device? - Would fine-tuning make a small specialist match a giant on this narrow task? - Should it be hybrid — local for the common case, cloud for the hard one? - Am I honest about the capability I'm trading for local benefits?
- cloud / on-device / edge — where the model runs: a distant data center, or locally. - round-trip — the network trip to a server that on-device avoids. - small language model (SLM) — a compact model built to run on limited hardware. - quantization — storing a model at lower precision to shrink it. - distillation — training a small model to imitate a large one. - specialization / fine-tuning — a narrow small model rivaling a giant at one task. - hybrid — local for the common case, cloud for the hard one; the device-cloud split.
- You go local when a benefit is decisive — privacy, offline, latency, or cost. - You use a small enough model — by design, quantized, or distilled — that fits the device. - You specialise or fine-tune a small model to rival a giant on a narrow task. - You design a hybrid split, keeping private and offline work local, escalating the hard cases. - You're honest about the capability trade and scope the feature to what the small model can do.
Edge AI runs a small model right on the device, trading raw capability for privacy, offline use, instant response, and zero cost. Specialise it to rival a giant on a narrow task, go hybrid to keep both, and be honest about the trade.