Express course · No. 31

Every call to a model costs money and time, metered by the token — and the bill scales with how much you send and how smart a model you use. Treating cost and speed as design constraints, not afterthoughts, is what separates a demo that works once from a product that runs affordably at scale. Learn the words and the levers, and 'make it cheaper' becomes routine instead of a crisis.

Essence only · One picture per idea · Cost is a design constraint

§ 01

To control the cost of an AI feature, you first have to understand the meter. Models are billed by the token, and almost every economic decision follows from what that means.

The token is the unit of cost

A taxi meter charging by the mile — the trip's price isn't a flat fee, it's the distance travelled, ticking up the whole way.

A token is a chunk of text — roughly a word or a piece of one — and it's the unit you're billed in. Every model call costs an amount based on the number of tokens involved, the way a taxi charges by distance. This is the foundational fact of AI economics: you're not paying per request at a flat rate, you're paying for the volume of text the model reads and writes. Once you see the token as the meter, the cost of everything becomes legible.

You pay for both what goes in and what comes out

A phone call where you're billed for both what you say and what you hear — the whole conversation counts, not just your half.

You're charged for input tokens (everything you send — the prompt, the context, the history) and output tokens (everything the model generates). Both halves count, and a long context you send on every call costs every time, even if the answer is short. This is why a feature that stuffs huge context into each request is expensive regardless of output size — you pay for the whole conversation, in both directions, on every single call.

Cost scales with context and number of calls

A delivery service billing per package and per mile — your total is how many trips times how far each one goes. Cut either and the bill drops.

Your total cost has two multipliers: how many tokens per call (driven mostly by how much context you send) and how many calls you make (driven by how chatty your design is — a single call, a fixed chain, or an agent looping many times). A long context multiplied across many calls is how AI bills explode. Both levers are in your control, and almost every cost-saving technique in this course is really about pulling one of them: send less per call, or make fewer calls.

Models bill by the token, for both input and output. Your cost is tokens-per-call times number-of-calls — so every saving comes down to sending less or calling less.

§ 02

The instinct is to use the smartest model for everything. That instinct is expensive, because the price gap between models is enormous and most tasks don't need the top of the range.

Models vary hugely in price and power

A vehicle fleet from a bicycle to a freight truck — wildly different costs per trip, and using the truck to deliver a single letter is a lot of money for the job.

There's a wide spectrum of models, from small and cheap to large and expensive, and the price difference between the top and the bottom is often enormous — many times over per token. The biggest, smartest frontier model is dramatically more costly than a small one. So the choice of model is one of the largest cost decisions you make, and defaulting to the most powerful one for every task is defaulting to the most expensive one.

Most tasks don't need the smartest model

You don't hire a top surgeon to apply a bandage — the routine task is done just as well by someone far less expensive, and the specialist is wasted on it.

The frontier model is genuinely better at hard, open-ended reasoning — but most of what real products do isn't that. Classifying a message, extracting a field, summarising a paragraph, routing a request: a small, cheap model handles these at a quality you can't distinguish from the giant. Paying frontier prices for routine work is the most common waste in AI products. The smartest model is the exception you reach for, not the default you start from.

Match the model to the task's difficulty

A good manager assigns the hard problem to the expert and the routine work to the junior — matching the talent to the task, not the other way round.

The principle is to fit the model to what the task actually demands. Hard, novel, multi-step reasoning earns a powerful model; simple, well-scoped, repetitive work goes to a small one. This isn't a quality compromise — it's matching capability to need, so you stop overpaying for routine work while still bringing the heavy model where it's genuinely required. Getting this match right, task by task, is the single biggest lever on an AI product's bill.

Model prices vary enormously, and most tasks don't need the frontier. Match the model to the task's difficulty — small for routine, powerful for genuinely hard — and stop overpaying for the easy majority.

§ 03

If most work is easy and some is hard, you don't pick one model for everything — you send each request to the right one. That dynamic choice is routing, and it's one of the highest-leverage patterns there is.

Default to the small model, escalate when needed

A support desk where the front-line staff handle most questions and only pass the genuinely tricky ones up to a specialist — most issues never need to escalate.

The core pattern is to route by difficulty: send each request to a small, cheap model by default, and escalate to a larger, pricier one only when the task actually needs it. Since the easy cases are the majority, most requests are handled cheaply, and you pay the premium only for the few that earn it. This flips the default from "use the best model and economise later" to "use the cheapest model that clears the bar, and escalate when it doesn't."

A router decides which model gets each request

A triage nurse who quickly assesses each patient and sends them to the right level of care — a fast, cheap judgement that directs the expensive resources where they're needed.

To route, something has to decide each request's difficulty — a router. It can be simple rules (short, structured tasks go small; long, open-ended ones go large), a cheap classifier, or even a small model judging the difficulty. The router itself must be cheap and fast, since it runs on everything. A good router quietly sends the bulk of traffic to the cheap model and reserves the frontier for the genuinely hard slice — turning the price spread between models into savings.

Let a strong model plan, cheap models execute

An architect designs the building, but the construction crew does the bulk of the labour — you pay the expensive expert for the thinking and cheaper hands for the work.

A powerful version of routing is plan-and-execute: use one call to a strong model to break a hard task into concrete steps, then run those steps with a cheaper, smaller model. The expensive reasoning happens once; the bulk of the work runs cheap. This captures the frontier model's planning ability where it matters while keeping the per-step cost low — a heterogeneous design that can cut the bill dramatically versus running everything on the big model.

Route by difficulty: small model by default, escalate to the big one only when the task earns it. A cheap router and a plan-cheap-execute split turn the price spread between models into savings.

§ 04

The cheapest model call is the one you never make. When the same or similar work comes up repeatedly, remembering the answer instead of recomputing it is one of the biggest savings available.

Don't pay twice for the same answer

A clerk who keeps the most-asked answers on a card at the desk, instead of looking each one up from scratch every single time someone asks.

If your app produces the same model output repeatedly — the same question, the same document summarised, the same lookup — you're paying for identical work again and again. Caching stores the result the first time and serves it instantly and for free on the repeats (the caching course covers the mechanics). The expensive call happens once; the cheap reads happen many times. Wherever the same input recurs, a cache turns repeated cost into a single one.

Prompt caching reuses the unchanging part

A form letter where the long standard preamble is pre-printed, and you only fill in the few lines that change — you don't rewrite the whole thing each time.

Often a big chunk of your context is the same on every call — a long system prompt, fixed instructions, a shared document. Prompt caching lets the model reuse that unchanging prefix instead of reprocessing it every time, charging far less for the repeated part. Since that fixed context is often the bulk of your input tokens, caching it can cut input cost substantially on high-volume features. It's a near-free saving you get just by structuring the stable part of your prompt so it can be cached.

Semantic caching catches near-duplicates

A help desk that recognises "how do I reset my password" and "I forgot my password" as the same question — and gives the same prepared answer to both.

Beyond exact matches, semantic caching uses meaning (via embeddings) to recognise when a new request is close enough to a previous one to reuse the answer — "what's your refund policy" and "how do returns work" need not be recomputed separately. This catches the common case where users ask the same thing in different words, extending the cache's reach far beyond identical inputs. Used carefully, it turns a long tail of rephrased-but-equivalent questions into cheap cache hits.

The cheapest call is the one you skip. Cache repeated answers, reuse the unchanging prompt prefix, and catch rephrased duplicates semantically — repeated work is repeated cost you don't have to pay.

§ 05

When you do make a call, you control how much it costs by controlling how much you send. Sending less and grouping work are the everyday levers that keep per-call cost down.

Send less context

Packing only what you'll actually use for the trip instead of your whole wardrobe — every extra item costs to carry, and most of it you'd never touch.

Since you pay per input token, the most direct saving is to send less context. Trim the conversation history, summarise old turns instead of carrying them verbatim, retrieve only the few most relevant chunks rather than everything. This is the context-engineering discipline paying off twice: a tighter context isn't just better for quality, it's cheaper on every call. Most bloated AI bills have bloated prompts underneath them — cut the padding and the cost falls with it.

Ask for less output, structured

Asking a question that wants a yes-or-no gets you a quick answer; asking one that invites an essay gets you an essay you have to pay for and then trim.

You pay for output tokens too, so don't make the model generate more than you need. Ask for concise, structured output — the specific fields, a short answer, no rambling preamble — rather than a long essay you'll only parse a number out of. Structured output (the structured-output course) does double duty here: it's more reliable and it's cheaper, because a tight schema produces fewer tokens than free prose. Constrain what comes back, and you constrain the output half of the bill.

Batch work where you can

Running a full load of laundry instead of one shirt at a time — grouping the work spreads the fixed overhead and gets more done per run.

When you have many similar requests that don't each need an instant answer, batching them — processing many together — is often cheaper than firing them one at a time, and providers frequently offer a discount for batch jobs that can run when convenient. This trades latency (each item waits for the batch) for lower cost, which is the right deal for background or bulk work — processing a backlog, enriching a dataset — where speed per item doesn't matter. Where you don't need it now, batching makes it cheaper.

Control per-call cost by sending less: trim context, ask for concise structured output, and batch non-urgent work. A tighter prompt is cheaper as well as better.

§ 06

Cost never lives alone — it trades against speed and quality. Seeing the three as a triangle you balance per use case keeps you from optimising one into ruining another.

The three pull against each other

The old workshop sign — "fast, cheap, good: pick two" — because pushing hard on one usually costs you another.

Cost, latency (speed), and quality form a triangle, and they trade off. The biggest, smartest model gives the best quality but costs the most and is often slowest; a small model is cheap and fast but weaker on hard tasks; heavy context improves quality but raises both cost and latency. You rarely get all three maxed at once. So you don't optimise cost in isolation — you decide, for each use case, which corner matters most and what you'll trade for it.

Pick the balance per use case

You choose a sports car for the race and a cargo van for the move — same question, opposite answers, because the job decides what matters.

The right balance depends entirely on the feature. A user-facing chat lives or dies on latency and needs a fast, capable model. A nightly batch job cares only about cost and can use the cheapest, slowest option. A high-stakes legal or medical answer prioritises quality above both. There's no single right point on the triangle — naming which corner this particular feature must win, and which it can sacrifice, is the decision that drives every model and design choice around it.

Streaming buys perceived speed cheaply

A kitchen that brings out each dish as it's ready instead of making you wait for the whole meal — the total time is the same, but the wait feels far shorter.

One trick sidesteps the triangle: streaming the model's output token by token as it's generated, so the user sees words appearing immediately instead of staring at a blank screen until the whole answer is done. The total time is unchanged, but the perceived latency drops dramatically, because something is happening right away. This is a cheap way to make a feature feel fast without a faster model — managing the experience of latency rather than the latency itself.

Cost, latency, and quality trade off — you rarely max all three. Pick which corner each feature must win, sacrifice the right other, and use streaming to buy perceived speed for free.

§ 07

AI economics comes down to a simple posture: know what things cost, and deliberately spend the least that does the job. The levers are few, and most of the savings come from refusing to overpay by default.

Measure cost per task, not just the total bill

An itemised bill that shows which dish cost what, instead of one big number — only the breakdown tells you where to cut.

You can't control a cost you don't measure. Track what each feature, each call type, each task actually costs in tokens, not just the lump monthly bill, so you can see where the money goes and which part to optimise. Cost is something you instrument and watch, like performance — most expensive AI features have one or two cost hotspots that a breakdown reveals instantly and a total hides completely. Measure per task, and the place to save announces itself.

Use the cheapest thing that clears the bar

You buy the tool that's good enough for the job, not the most expensive one on the shelf — capability beyond what you need is just money spent.

The governing principle is to use the cheapest model, the smallest context, and the fewest calls that still meet your quality bar — and no more. This isn't cheapness for its own sake; it's refusing to pay for capability the task doesn't use. Set the quality bar with evals, then find the lightest configuration that clears it. Most cost savings aren't clever tricks — they're just declining to default to the most powerful, most context-heavy, chattiest option when a leaner one passes the bar.

Before you ship an AI feature at scale

Do you know the cost per task — measured in tokens, not just the total bill? - Is the model matched to difficulty — small for routine, frontier only where needed? - Are you routing — easy majority to a cheap model, escalating the hard cases? - Are you caching — repeated answers, the fixed prompt prefix, near-duplicates? - Are you sending the minimum — trimmed context, concise structured output, batched where non-urgent? - Which triangle corner wins — cost, latency, or quality — and is the design tuned for it?

The words you now own

token / input / output — the unit of billing, charged for what you send and what comes back.
frontier model — the biggest, smartest, and most expensive option. - routing / router — sending each request to the right model by difficulty. - plan-and-execute — a strong model plans, cheap models do the bulk. - caching / prompt caching / semantic caching — reusing answers, fixed prefixes, and near-duplicates. - batching — grouping non-urgent work to lower cost, trading latency. - latency / cost / quality — the triangle you balance per use case; streaming for perceived speed.

Signs you manage cost well

You measure cost per task and know your hotspots, not just the monthly total. - You match the model to difficulty and route instead of using the frontier for everything. - You cache repeated work and the fixed prompt prefix. - You send the minimum context and ask for concise output. - You balance the triangle per feature and use streaming to make things feel fast.

AI economics is deliberate thrift: bill by the token, match the model to difficulty, route, cache, send the minimum, and balance cost against latency and quality — the cheapest thing that clears the bar wins.