Express course · No. 26

An LLM is brilliant and blank: between calls it remembers nothing, and on each call it knows only the text you put in front of it. Context engineering is the discipline of assembling exactly the right text into that window — the instructions, the facts, the history, the tools, no more and no less. It's where LLM quality actually lives, and it's mostly not about clever wording.

Essence only · One picture per idea · Engineering over magic

§ 01

Everything in this course follows from one fact about how a language model works: it has no memory, and its entire awareness on any call is the text you send it. Get this and the rest is obvious.

The model remembers nothing between calls

A brilliant consultant with total amnesia — every meeting, you must re-hand them every document, because they remember nothing from last time.

A language model is stateless: it keeps nothing between requests. The illusion of a chatbot remembering your conversation is just the whole history being re-sent every time. The model itself starts fresh on each call, knowing only what's in front of it right now. So "what does the model know?" is never about the model — it's entirely about what you chose to include this time.

The context window is its only working memory

A worker who can only act on the notes pinned to the board in front of them — clear the board and it's as if nothing ever happened.

The context window is the block of text the model sees on a call — your instructions, the conversation, any data, the tools. It is the model's entire working memory, and it has a finite size. If a fact, a rule, or a past message isn't in the window, the model simply doesn't know it. Everything the model can reason about, it reasons about from the window — which makes the window the thing you actually engineer.

So quality is decided by what you put in

A genius answering only from the single folder you hand them — give them the right folder and the answer is brilliant; give them the wrong one and it's confidently useless.

Because the model works only from the window, the quality of its output is mostly decided before the model runs — by what you assembled. The same model gives a great answer or a hopeless one depending entirely on whether the right context was in front of it. This reframes the whole job: you're not coaxing intelligence out of the model, you're curating what it gets to see. That curation is context engineering.

The model is stateless and knows only its context window. Output quality is decided by what you assemble into that window — so the window, not the model, is what you engineer.

§ 02

People talk about "prompt engineering," but as apps get serious the real craft moves elsewhere — from wording a request to assembling the right information around it. That shift is the heart of this course.

Wording the request versus assembling the context

A lawyer wins not by clever phrasing but by handing the judge exactly the right documents, in the right order, with nothing irrelevant — the case is in the papers, not the speech.

Prompt engineering is about how you phrase the instruction. Context engineering is the bigger job: assembling all the information the model needs — the relevant facts, the history, the examples, the tools — into the window for each call. As applications grow, wording matters less and assembly matters more. Most "the AI gave a bad answer" moments are not bad phrasing; they're the model missing context it was never given.

Most quality problems are context problems

An assistant who gives a wrong answer because you forgot to mention the one constraint that changed everything — not stupid, just uninformed.

When an LLM feature underperforms, the instinct is to tweak the prompt wording or blame the model. Far more often the real cause is missing or wrong context: the relevant document wasn't retrieved, a key rule wasn't included, stale history confused it. Before you reword anything, ask what the model actually had in front of it. Fixing the context fixes more bugs than fixing the phrasing ever will.

The model is only as good as its inputs

The finest chef can only cook with the ingredients on the counter — hand them the wrong ones and skill can't save the dish.

A capable model with poor context loses to a weaker model with excellent context. This is why context engineering, not model choice, is usually the highest-leverage place to spend effort: the same model's output swings wildly with what you feed it. You don't get a better answer by demanding harder — you get it by putting better, sharper, more relevant material in the window. The inputs are the lever.

Prompt engineering is the wording; context engineering is assembling the right information. The second is where quality lives — and most bad answers are missing context, not bad phrasing.

§ 03

Some of what goes in the window is durable and some is per-request, and a few specific ingredients do most of the work. Knowing what they are makes assembling context concrete.

The system prompt sets standing behaviour

A restaurant's standing policy — "we're vegetarian, we close at ten" — versus tonight's specific order. One frames everything; the other is the ask.

The system prompt holds durable instructions that apply across the whole interaction: the model's role, the rules it must follow, the tone, the output format. The user message is the specific request. Putting your persona, guardrails, and standing rules in the system prompt keeps them stable while the user's questions vary. It's the difference between who the model is on every call and what it's being asked this time.

Show, don't just tell: few-shot examples

Teaching someone a format by showing three finished examples is faster and clearer than describing it in words and hoping they picture it right.

Often the most reliable way to get the shape of output you want is to include a few examples of input and ideal output right in the context — called few-shot prompting. The model pattern-matches examples far better than it follows an abstract description. Two or three sharp examples can beat a paragraph of instructions, and they're often the fastest fix when the model almost does what you want but not quite.

Assemble the window from parts

Packing a briefcase for a meeting: the agenda, the two relevant reports, a note of the constraints — chosen on purpose, not your whole filing cabinet tipped in.

A real context window is assembled from pieces every call: the system prompt, the relevant history, retrieved facts (RAG), examples, the available tools, and finally the user's request. Context engineering is deciding — deliberately, each time — which pieces go in and which stay out. Thinking of the window as something you build from parts, rather than just a prompt you type, is the mental shift that makes everything else click.

The system prompt sets durable behaviour; the user message is the request; few-shot examples show the shape you want. A window is assembled from these parts, deliberately, every call.

§ 04

The biggest beginner mistake in context engineering is thinking more is better. The opposite is true: every irrelevant token you add actively makes the answer worse.

More context is not better context

A briefing that's one sharp page beats a 200-page dump — the reader finds the signal instead of drowning in it, and decides faster and better.

It's tempting to stuff everything possibly relevant into the window "just in case." But more isn't better — it's usually worse. Every token competes for the model's attention, and irrelevant material dilutes the signal, pulls the model toward whatever's nearby, and makes it likelier to latch onto the wrong thing. The skill is selecting the few things that matter, not including everything that might.

Noise raises the chance of a wrong answer

Hide the one relevant fact inside a mountain of mostly-irrelevant paper and even a careful reader starts citing the wrong page — the noise itself causes the error.

Long, padded context doesn't just waste space; it actively degrades quality. Buried in irrelevant material, the model's attention spreads thin, and in long contexts this measurably raises the rate of hallucination and mistakes — it confidently uses something that was never meant to be the answer. Adding "just in case" context can be the very thing that causes the failure. Less, sharper context is more reliable context.

Curate ruthlessly for this call

A good editor doesn't add — they cut, handing you only the lines that earn their place on this page.

The discipline is ruthless curation: for this specific call, include the few things the model genuinely needs and leave everything else out. That means retrieving only the most relevant chunks, trimming history to what matters, dropping context that's done its job. Relevance over completeness is the governing principle — you're not trying to give the model everything, you're trying to give it exactly enough.

More context is not better — irrelevant tokens dilute attention and raise hallucination. Curate ruthlessly: give the model exactly what this call needs, and nothing else.

§ 05

Beyond quality, the window is a finite, paid resource. Treating it as a scarce budget rather than an infinite dumping ground is what separates a toy from a product.

The window is finite and every token costs

A suitcase with a weight limit and a fee per kilo — you pack what you'll actually need, not your whole wardrobe, because there's a hard cap and a price.

The context window has a maximum size, and every token in it costs latency and money — bigger contexts are slower and more expensive on every single call. So context isn't free space to fill; it's a budget you spend. A feature that stuffs the window on every request is slow and costly at scale, even before quality suffers. The window is a resource to allocate, not a void to fill.

Spend it on what earns its place

A tight travel budget forces real choices — you spend on what matters for the trip and skip the rest, and the trip is better for the discipline.

Treating the window as a budget changes how you build: you trim the conversation history, summarise old turns instead of carrying them verbatim, and retrieve only what's relevant now rather than everything you have. Each token should earn its slot. This is the same discipline as performance or caching — spend the scarce resource deliberately, on the things that actually move the outcome.

A bigger window doesn't end the discipline

Renting a bigger truck doesn't mean you should fill it with junk — more capacity is room for more useful things, not licence to stop packing carefully.

Models keep shipping larger context windows — a million tokens, more — and it's tempting to think the budget problem goes away. It doesn't. A bigger window still costs more and slower per token, and crucially, a fuller window still dilutes attention and invites mistakes. More capacity is not permission to dump; it's just more room you still have to spend wisely. The discipline of curating the window survives any size increase.

The window is finite and every token costs latency and money. Spend it as a budget — trim, summarise, retrieve only what's needed — and don't let a bigger window end the discipline.

§ 06

In long, multi-step interactions, the context doesn't just sit there — it accumulates and decays. Managing that decay is the hardest and most important part of context engineering at scale.

Long contexts accumulate stale, contradictory state

A game of telephone down a long line — by the end the message has quietly mutated, and everyone is confidently repeating something slightly wrong.

Across a long conversation or a multi-step agent task, the window fills with history: earlier turns, tool outputs, half-finished reasoning. Over time this pile grows messy — stale facts sit next to fresh ones, contradictions accumulate, the original goal slides out of focus. The industry calls the resulting quality decay context rot: the same model gets less reliable as its own accumulated context gets noisier.

Compress as you go

A good note-taker doesn't keep every word of a long meeting — they distil it to the decisions and open questions, carrying forward the conclusion, not the transcript.

The fix is to actively compress the context as it grows: summarise finished steps into a short running state instead of dragging the full transcript forward, drop tool outputs once you've extracted what mattered, keep a tight working summary rather than the raw history. The model carries the conclusion, not every word that produced it. This is how long-running agents stay coherent — they manage their own context instead of letting it pile up.

Re-anchor the goal

On a long, winding hike you re-check the map and the destination regularly — otherwise you drift, one reasonable-looking turn at a time, away from where you meant to go.

Over many steps, the original objective erodes under the weight of everything that's happened since. So you re-anchor it: restate the goal and the key constraints in the window at each step, so the model stays pointed at the real target instead of drifting with the conversation. A bigger window makes this worse, not better — it gives drift more room to accumulate. Active management, not raw capacity, is what keeps a long task on track.

Long contexts rot — stale, contradictory state accumulates and quality decays. Compress finished steps into a running summary and re-anchor the goal, because a bigger window only gives drift more room.

§ 07

Context engineering done well is deliberate assembly, measured like everything else. The whole practice comes down to building the window on purpose and checking that it helped.

Build the window deliberately, not by accident

A chef plates a dish on purpose — each element placed for a reason — instead of scraping whatever's on the counter onto the plate.

The core habit is treating the window as something you construct deliberately: decide what the system prompt holds, what history to keep or summarise, what to retrieve, which examples to show, which tools to expose. Many LLM apps assemble context by accident — whatever happened to be lying around gets sent. Building it on purpose, with each part justified, is most of what separates a reliable feature from a flaky one.

Measure context the way you measure code

You don't guess whether a change helped — you test it. The same goes for what you put in the window: an answer got better or worse, and you should know which.

Context changes are real changes, so verify them: when you adjust what goes in the window — more examples, different retrieval, trimmed history — check the effect with evals, don't just eyeball one output. Context engineering without measurement is tuning by vibes, and it's how a "harmless" addition quietly degrades quality. Treat the window's contents as a thing you test, the same discipline as the evals course.

Before you ship an LLM feature
  • Is everything the model needs in the window — or are you expecting it to know what you never gave it? - System versus user — durable rules in the system prompt, the request in the user message? - Would examples help — is this a case where few-shot beats more instructions? - Is every piece relevant — or is padding diluting attention and raising errors? - Are you within budget — trimming and summarising rather than dumping everything? - For long tasks, are you compressing and re-anchoring against context rot?
The words you now own
  • stateless / context window — the model remembers nothing; the window is its only working memory. - prompt engineering / context engineering — wording the request, versus assembling the information. - system prompt / user message — durable standing behaviour, versus the specific request. - few-shot — teaching the shape of output by showing examples in the context. - relevance over completeness — curating the few things that matter, not stuffing everything. - context budget — the window is finite and every token costs latency and money. - context rot / drift — long contexts accumulate stale state and decay; compress and re-anchor.
Signs you engineer context well
  • You fix bad answers by checking what's in the window before rewording the prompt. - Durable rules live in the system prompt; you reach for few-shot examples when shape matters. - You curate ruthlessly — relevance over completeness — instead of padding the window. - You treat the window as a budget, trimming and summarising rather than dumping. - For long tasks you compress and re-anchor, and you measure context changes instead of guessing.

Context engineering is deliberate assembly of the window: the right rules, the right facts, the right examples, curated to relevance, kept in budget, and managed against rot — built on purpose and measured, not typed and hoped.

End of express course · 7 chapters · engineering over magic

Next comes practice: take an LLM feature that sometimes gives bad answers, and before touching the prompt wording, log exactly what was in the window on a bad call — you'll usually find missing, stale, or irrelevant context, not bad phrasing. Then fix the assembly: add the missing fact, cut the noise, summarise the history. The discipline clicks the moment a better-built window beats a cleverer prompt. But hold one idea above the rest: the model only knows what's in the window. Stop trying to coax intelligence out of it, and start engineering what it gets to see — that's where the quality was the whole time.