Express course · No. 07
A language model is a brilliant next-word guesser with no memory and a habit of making things up. Building with it isn't about magic prompts — it's the engineering you wrap around a powerful, unreliable component to make it trustworthy.
Essence only · One picture per idea · Engineering over magic
Before building with a language model, understand what it really is — because almost every mistake comes from expecting it to be something it isn't.
It predicts the next word, very well
Your phone's keyboard suggesting the next word — but scaled up so far it can write an essay, debug code, or explain physics, one plausible word at a time.
An LLM is a next-token predictor trained on a huge slice of human text. It doesn't look things up; it generates what's statistically likely to come next. That's why it's astonishing at language and reasoning patterns — and why it isn't a database of facts. It produces what sounds right, which is usually right, and sometimes confidently wrong.
It has no memory between calls
A brilliant consultant with total amnesia — every meeting, you must re-hand them every document, because they remember nothing from last time.
A model remembers nothing between requests. Its only working memory is the context window — the text you send in this call. Chat history, your data, your instructions: if it isn't in the context, the model doesn't know it. Everything in this course is, in some sense, about what you put in that window.
It will make things up — confidently
A charming know-it-all at a dinner party who would rather invent a convincing answer than admit they don't know.
When the model doesn't know, it doesn't stop — it hallucinates, producing fluent, plausible, wrong text with the same confidence as the truth. This isn't a bug you can fully patch; it's the nature of a guesser. Most of the engineering in this course exists to manage that one fact.
It's non-deterministic — treat it that way
Asking the same question to ten experts — you get ten slightly different good answers, not one identical reply.
The same prompt can give different output each time. So you can't treat an LLM like a function with a fixed return; you treat it as smart, fallible, non-deterministic input — to be constrained, validated, and checked, never blindly trusted. That mindset is the whole game.
An LLM doesn't know things. It predicts them. Build everything else around that one truth.
The prompt is how you tell the model what you want. The clearer the instruction, the better the output — and most "the AI is dumb" moments are really unclear prompts.
A prompt is a brief, not a spell
Handing a task to a sharp new hire: who they're acting as, what you want, the constraints, and an example of "good." Vague brief, vague result.
A good prompt states the role ("you are a careful editor"), the task, the constraints (length, tone, format), and ideally an example. There are no magic words — just clear instructions, the same as you'd give a capable person. Most prompt problems are really specification problems.
System vs user: standing rules vs the request
A restaurant's standing policy — "we're vegetarian, we close at ten" — versus tonight's specific order. One frames everything; the other is the ask.
The system prompt sets durable behaviour — role, rules, tone — that holds across the whole conversation. The user prompt is the specific request. Putting your persona and guardrails in the system prompt keeps them stable while the user messages vary.
Show, don't just tell: few-shot examples
Teaching someone a format by showing three finished examples is faster and clearer than describing it in words.
Often the best way to get the shape of answer you want is to include a few examples of input and desired output right in the prompt ("few-shot"). The model pattern-matches them better than it follows an abstract description. Two or three good examples can beat a paragraph of instructions.
Ask for structure your code can use
A form with labelled boxes gets you fillable data; "write me a paragraph about it" gets you prose you then have to parse.
If your program will consume the output, ask for structured output — JSON matching a schema, not free text. Modern models can be constrained to valid JSON, turning the LLM from a chatbot into a component your code can rely on. Structured output is the bridge between language and software.
There are no magic words. A good prompt is a clear brief — role, task, constraints, examples.
As apps get serious, the craft shifts from wording the prompt to assembling the right context. This is the real discipline — "context engineering."
The hard part is what you put in the window
A lawyer wins not by clever phrasing but by handing the judge exactly the right documents, in the right order, with nothing irrelevant.
"Prompt engineering" was about phrasing; context engineering is about assembling the right information — the relevant facts, history, and tools — into the window for each call. The model's answer is only as good as what you put in front of it. Most quality problems are context problems, not wording problems.
Relevance beats completeness
A briefing that's one sharp page beats a 200-page dump — the reader finds the signal instead of drowning in it.
More context is not better. Every token should earn its place: irrelevant text dilutes the model's attention, costs money, and actually raises the chance of hallucination in long contexts. The skill is selecting the few things that matter, not stuffing in everything that might.
The context window is a budget
A suitcase with a weight limit — you pack what you'll actually need, not your whole wardrobe, because there's a hard cap and a fee.
The window is finite, and every token costs latency and money. So you spend it deliberately: trim the history, summarise the old, retrieve only what's relevant now. Treating context as a scarce budget — not an infinite dumping ground — is what separates a toy from a product.
Prompt engineering is the wording. Context engineering is the documents. The second one is where quality lives.
The model doesn't know your data, or anything after its training cutoff. RAG is how you hand it the facts to answer from — the single most important pattern in serious LLM apps.
RAG: retrieve first, then answer
An open-book exam — instead of trusting memory, you look up the relevant pages first, then write the answer from what's in front of you.
Retrieval-Augmented Generation means: take the user's question, fetch the most relevant chunks of your documents, put them in the context, and ask the model to answer from them. Now it works from your actual data, not its fuzzy memory. This is how assistants answer about your product, your docs, your knowledge base.
Embeddings and vector search find the right chunks
A librarian who finds books by what they're about, not by exact title — "things like this," by meaning.
To find relevant chunks, you store your documents as embeddings in a vector database and search by similarity to the question (the engine from the databases course). Good retrieval is the heart of RAG: get the right chunks in, and the answer is grounded; get junk in, and the model confidently answers from junk.
Grounding kills hallucination — mostly
A journalist who must cite a source for every claim writes far fewer fabrications than one writing from memory.
The big win of RAG is grounding: because the answer is drawn from retrieved text, you can demand citations and check that claims are actually supported. It doesn't erase hallucination, but it slashes it — and lets you catch what slips through. An answer you can trace to a source is one you can trust.
Garbage in, confident garbage out
Hand someone the wrong file and ask them to summarise it — they'll give you a perfect summary of the wrong thing.
RAG only works if retrieval works. Bad chunking, a weak search, stale data — and the model faithfully answers from the wrong context, sounding just as sure. So most of the effort in RAG isn't the model; it's chunking, indexing, and measuring retrieval quality. Fix retrieval before you blame the model.
Don't ask the model what it knows. Hand it the facts and ask it to answer from those.
A prompt produces text. To make the model do things — search, calculate, send, look up — you give it tools; to make it pursue a goal over many steps, you put it in a loop. That's an agent.
Tool use: let it call your functions
A smart assistant who can't reach the filing cabinet themselves — but can ask you to, and tell you exactly which file to pull.
With tool use (function calling), you describe functions the model may request — search_orders, send_email, get_weather — and when it wants one, it returns a structured call, your code runs it, and you feed the result back. The model decides what to do; your code stays in control of doing it. This is how an LLM reaches beyond text into the real world.
An agent is the model in a loop
A person solving a problem: think, take an action, look at the result, think again — repeating until it's done, not in one shot.
An agent wires the model into a loop — reason, call a tool, observe the result, reason again — with the LLM as the orchestrator deciding each next step. This lets it handle open-ended tasks ("research this and draft a reply") that no single prompt could. Coding assistants and research bots work this way.
Memory beyond the window
A long project needs a notebook — you can't hold months of work in your head, so you write down what matters and look it back up.
Because the window is finite, agents need memory: notes, summaries, and retrievable history kept outside the context and pulled back in when relevant. Without it, an agent forgets the start of a long task by the end. Memory is what turns a chat into something that can work over time.
Loops are powerful — and a liability
A robot vacuum that mostly cleans the house, but occasionally gets stuck spinning in a corner or wanders into the pool.
An agent loop is unpredictable: it can take wrong turns, repeat itself, run up cost, or act on a bad decision. So you keep it on a leash — step limits, tool permissions, human approval for risky actions, full logging. Reach for an agent only when the task genuinely needs many adaptive steps; a single call or a fixed chain is cheaper and safer when it'll do.
Tools let the model act. A loop lets it persist. Both need a leash — capability without limits is a liability.
A demo that works once is easy. A system you can trust takes the unglamorous work — measuring, bounding, and watching a component that's non-deterministic by nature.
Evals: you can't improve what you don't measure
A school that never grades anyone has no idea who's learning — and no way to get better.
An eval is a test suite for your AI: inputs with known-good expectations, scored automatically — did it retrieve the right docs, call the right tool, answer faithfully? Without evals you're tuning prompts by vibes. And expect a hard truth: reaching 80% quality is quick; grinding from there to 95% is most of the work.
Guardrails: boundaries on input and output
Bumpers on a bowling lane — they don't roll the ball for you, but they keep it out of the gutter.
Guardrails are checks around the model: on the way in, block prompt-injection and out-of-scope requests; on the way out, filter unsafe content and validate the format before it reaches a user or another system. They run before and after the model, in your code — because you can't trust the model to police itself.
Cost and latency are design constraints
A taxi meter running the whole ride — every extra mile of context, every extra step, adds to the fare and the wait.
Every token costs money and time, and cost scales with how much context you send and how many calls you make. So you design for it: trim and cache context, pick the smallest model that's good enough, and don't run an agent loop where one call works. A great answer that's too slow or too expensive isn't a great answer.
Hallucination is the risk you design around
You don't hand a brilliant but unreliable narrator the microphone unsupervised — you fact-check before it goes to print.
Every technique here — grounding with RAG, structured output, evals, guardrails, citations — exists to manage one core risk: the model stating falsehoods convincingly. You never fully remove it; you constrain where it can happen, detect it when it does, and never let an unverified claim reach a place where being wrong is expensive.
A demo trusts the model. A product measures it, bounds it, and watches it — because it's non-deterministic by nature.
The patterns stack from simple to complex. The skill is using the least powerful one that solves your problem — and treating the model as a component, not the architecture.
Climb the ladder only as far as you must
You don't book a moving truck to carry one box across the room. You scale the effort to the job.
There's a ladder: a single good prompt, then structured output, then RAG when it needs your facts, then tools, then a full agent loop when the task truly needs many adaptive steps. Each rung adds power, cost, and new ways to fail. Start at the bottom and climb only when the problem forces you — most features never need the top.
The LLM is a component, not the architecture
A car has a powerful engine, but it's bolted behind a firewall, fed clean fuel, and surrounded by brakes — it isn't the whole car.
Put the model behind an interface, like any other dependency, with validation around its inputs and outputs and the freedom to swap models or providers. The LLM is one powerful, unreliable part of your system — not its foundation. (The same "LLM as an adapter, not the architecture" lesson from system design.)
- What's the simplest rung — prompt, structured output, RAG, tools, agent — that solves this?
- Where do the facts come from — the model's memory (risky), or retrieved, citable data? - What happens when it's wrong, and is that place cheap or expensive to be wrong in? - How will I measure quality — what's the eval? - What are the guardrails on input and output? - What does it cost per call, and how slow is it?
- An agent loop for what one prompt would answer. - RAG over three documents you could just paste into the prompt. - Tuning prompts by vibes, with no eval to tell you if it improved. - Sending the whole knowledge base into the context on every call. - Trusting the model's output straight into a database or an email, unvalidated.
- The output is structured and validated before your code uses it. - Answers are grounded and citable, not pulled from thin air. - You have an eval that tells you when a change helps or hurts. - Guardrails sit on both the input and the output. - You used the simplest rung that works — and the model sits behind an interface you could swap.
Building with LLMs isn't prompt magic. It's ordinary engineering — context, retrieval, validation, evals — around an extraordinary, unreliable component.