Express course · No. 12

RAG is how you hand a language model the facts to answer from, instead of trusting its fuzzy memory. It's the single most important pattern in serious LLM apps — and when it goes wrong, the model is almost never the culprit. The work, and the quality, live in retrieval: how you chunk, index, search, and rank the text you put in front of the model.

Essence only · One picture per idea · Retrieval is the work

§ 01

A model doesn't know your data, or anything after its training cutoff. RAG is how you give it the facts to answer from — so understand the problem it solves before the mechanics.

The model's memory is the wrong place for your facts

A brilliant graduate who read the whole library years ago — but not your company's files, and nothing printed since they graduated.

An LLM only knows what was in its training data, frozen at a cutoff date. It doesn't know your documents, your product, your customer's record, or yesterday's news. Ask it anyway and it will answer from fuzzy memory or invent something plausible. RAG exists to fix exactly this: stop relying on what the model memorised, and hand it the relevant facts at question time.

Retrieve first, then answer

An open-book exam — instead of trusting memory, you look up the relevant pages first, then write your answer from what's in front of you.

Retrieval-Augmented Generation is three steps: take the user's question, fetch the most relevant chunks of your documents, put them in the context, and ask the model to answer from those. Now the answer comes from your actual data, not the model's recollection. This is how an assistant answers about your docs, your policies, your knowledge base — accurately and up to date.

Grounding is the real prize

A journalist who must cite a source for every claim writes far fewer fabrications than one writing from memory.

The deep reason RAG matters is grounding: because the answer is drawn from retrieved text, you can demand citations and check that each claim is actually supported by a source. It doesn't fully erase hallucination, but it slashes it and makes what slips through catchable. An answer you can trace to a document is one you can trust — and defend.

It's a retrieval system with a model on the end

A research assistant is only as good as the filing system they search — give them a great library and a bad index, and they still hand you the wrong folder.

The mental model that saves you months: RAG is mostly a search problem, with a language model on the end to phrase the answer. The model is the easy, mostly-solved part. The hard, quality-determining part is everything that decides which text reaches the model — and that's the rest of this course.

Don't ask the model what it knows. Retrieve the relevant facts and ask it to answer from those — grounded, current, and citable.

§ 02

Before you can retrieve, you split your documents into pieces. How you cut them quietly decides retrieval quality — it's the least glamorous step and one of the most important.

You retrieve chunks, so chunks are the unit of quality

A reference book is only useful cut into findable entries — one giant unindexed scroll, or a thousand torn scraps, are both useless to look things up in.

You don't retrieve whole documents; you retrieve chunks — passages a few hundred words long. That chunk is the unit the search sees and the model reads, so its boundaries matter enormously. Good chunks are self-contained and on one topic. Get chunking wrong and no amount of clever search or smart model recovers it.

Too big buries the answer; too small loses the context

Hand someone an entire chapter to answer a one-line question and they wade through noise; hand them a single sentence with no surrounding context and they can't tell what it means.

Chunk too large and each one mixes the relevant fact with paragraphs of unrelated text — search gets fuzzy and the model's attention dilutes. Chunk too small and you sever the context that made the fact meaningful. The sweet spot holds one coherent idea with enough surroundings to stand alone. There's no universal number; it depends on your documents, and you tune it by measuring.

Cut on meaning, not on character count

Splitting a book by counting characters can cut a sentence — or a recipe — clean in half. Splitting on chapters and sections keeps each piece whole.

The naive approach chops every N characters, which slices through sentences, tables, and ideas. Better is to split on the document's natural structure — headings, sections, paragraphs — so each chunk is a complete thought. A little overlap between adjacent chunks keeps a fact from falling into the crack between two of them. Respect the shape of the document and retrieval gets easier for free.

Attach metadata to every chunk

A library card doesn't just hold the text — it records the author, date, and section, so you can filter to exactly the right shelf before you even search.

Store each chunk with metadata: source document, title, date, section, permissions. This lets you filter before or alongside the semantic search — only this product's docs, only current policies, only what this user is allowed to see — and it gives you the source for citations. Metadata is cheap to keep and turns a flat pile of text into something you can target.

You retrieve chunks, not documents. Cut on meaning, keep each one self-contained, and tag it — chunking quality caps everything downstream.

§ 03

To find the chunks that match a question, you search by meaning, not by matching words. Embeddings are the trick that makes "find me things like this" possible.

Embeddings turn meaning into coordinates

A map where every idea has a location, and things that mean similar things sit close together — "dog" near "puppy," both far from "tax return."

An embedding is a list of numbers that represents the meaning of a piece of text, placed in a high-dimensional space so that similar meanings land near each other. You embed every chunk once and store the vectors. Now "meaning" has coordinates, and "find related text" becomes "find nearby points" — a problem a computer can solve fast.

Vector search finds by similarity, not keywords

A librarian who finds books by what they're about — "things like this" — instead of only matching the exact title you said.

At question time you embed the question the same way, then search the store for the chunks whose vectors are closest to it. This is semantic search: it finds a chunk about "refund policy" even when the user asked about "getting my money back," because the meanings sit close. Keyword search would miss that; vector search is built for it.

The vector database is the engine

A warehouse designed so that, given any item, it can instantly hand you the hundred most similar items — not by reading every shelf, but by how it's organised.

A vector database stores your embeddings and answers "nearest to this" queries in milliseconds, even over millions of chunks, using an approximate index instead of scanning everything. It's the retrieval engine under RAG. You don't need to build it — but you do need to know it's where your chunks live and how fast and accurate its search is.

Similarity is not the same as relevance

Two passages can be on the same topic yet answer different questions — "how to cancel" and "why people cancel" sit close, but only one is what was asked.

Vector search returns what's semantically near, which is usually relevant but not always. A chunk can be about the right subject and still not contain the answer. This gap — near in meaning versus actually useful — is why raw vector search is a strong first pass but not the final word, and why the next section exists.

Embeddings give meaning coordinates; vector search finds chunks by similarity. It's powerful — and "near in meaning" isn't quite "answers the question."

§ 04

Here's the lesson that fixes most broken RAG systems: when answers are bad, the retrieval is usually the cause. Get the right chunks in, and a decent model does the rest.

Garbage in, confident garbage out

Hand someone the wrong file and ask them to summarise it — they'll give you a flawless summary of entirely the wrong thing, sounding completely sure.

A RAG answer is only as good as the chunks it was given. Retrieve the wrong passages and the model faithfully answers from the wrong context, with all its usual confidence. The bug looks like "the model is wrong," but the real failure happened upstream, in retrieval. So when answers are bad, inspect what was retrieved before you touch the prompt or the model.

Combine keyword and semantic search

One searcher who knows exactly what the words mean, and another who matches exact names and codes — together they catch what either alone would miss.

Vector search understands meaning but can fumble exact terms — product codes, names, rare jargon. Keyword search nails those but misses paraphrases. Hybrid search runs both and merges the results, covering each other's blind spots. For most real corpora, hybrid beats either alone, because real questions mix concepts with specific names.

Rerank the shortlist

A hiring process: a cheap first pass pulls fifty plausible résumés, then a careful reviewer reads them closely and ranks the real top five.

First-pass search is fast but rough. A reranker takes the top, say, fifty candidates and scores each one against the question more carefully, pushing the genuinely relevant chunks to the top. Retrieve broadly, then rerank to a precise few. This two-stage shape — wide cheap recall, then sharp ranking — is one of the highest-leverage upgrades to a mediocre RAG system.

Precision and recall pull against each other

A net with wide holes catches only the big fish but lets many through; a fine net catches everything, including weeds. You tune the mesh to the catch you need.

Retrieve too few chunks and you might miss the one with the answer (low recall); retrieve too many and you bury the answer in irrelevant text that dilutes the model and costs tokens (low precision). The right number of chunks balances the two, and it's specific to your data and question type. You don't guess it — you measure it, which is the next section.

Most bad RAG answers are bad retrieval. Combine keyword and semantic, rerank the shortlist, and check the chunks before you blame the model.

§ 05

Once the right chunks are in the context, the job is to make the model answer strictly from them — and to prove that it did. This is what turns retrieval into a trustworthy answer.

Answer only from the retrieved text

A witness instructed to testify only to what they personally saw — not what they assume, remember vaguely, or heard secondhand.

Instruct the model to answer only from the provided chunks, and to say it doesn't know when the answer isn't there. This is the core of grounding: the retrieved text is the allowed source of truth, and the model's own memory is off the table. It won't be perfect — the model can still drift — but the instruction plus the right context does most of the work.

Citations make claims checkable

A research paper with footnotes lets any reader trace a claim back to its source and verify it — or catch where it doesn't hold.

Have the model cite which chunk each part of its answer came from. Citations do two things: they let the user (and you) verify a claim against its source, and the very act of grounding each statement in a retrieved passage discourages the model from wandering off the facts. An answer with traceable sources is one you can audit; an unsourced answer is just a confident guess.

Teach it to say I don't know

The expert you trust most is the one who says "that's not in what I have" instead of confidently filling the gap with a guess.

The most dangerous RAG failure is answering anyway when retrieval came back empty or off-target. So make refusal a valid, expected output: if the chunks don't contain the answer, the model should say so, not improvise from memory. A system that admits the gap is far more trustworthy than one that papers over it — and it surfaces where your retrieval needs work.

Grounding reduces hallucination — it doesn't end it

Seatbelts cut deaths dramatically, but you still drive carefully. A strong safeguard is not a reason to stop watching the road.

Even with perfect chunks, the model can misread, over-generalise, or blend a retrieved fact with its own memory. Grounding slashes hallucination and makes it catchable; it doesn't remove it. So you still verify — with citations the user can check, and with evals that measure faithfulness. Treat grounding as a powerful control, not a guarantee.

Grounding means the answer comes only from retrieved text, with citations to prove it — and an honest "I don't know" when the facts aren't there.

§ 06

RAG has two stages that fail differently, so you measure them separately. Lump them together and you'll tune the wrong half for weeks. (The Evals course goes deeper; here's the RAG-specific shape.)

Measure retrieval and generation apart

A restaurant with a bad dish has two possible culprits — the ingredients delivered, or the cook. You taste the ingredients first, or you'll retrain the wrong person.

A RAG answer can fail because retrieval brought the wrong chunks, or because generation answered badly from the right chunks. These need different fixes, so measure them separately: did the right chunks come back? and, given those chunks, was the answer faithful and complete? Most teams who "can't improve their RAG" are scoring the final answer only and guessing which half to blame.

Retrieval: did the right chunks come back?

Before judging the essay, check the student was even handed the right reference pages — if not, nothing they wrote could have been right.

Evaluate retrieval on its own: for a set of questions with known relevant documents, did the search return them in the top results? This catches bad chunking, weak search, and missing reranking directly — the upstream causes of most bad answers. Fix retrieval first, because no improvement to the prompt or model can rescue an answer built from the wrong chunks.

Generation: was the answer faithful to the chunks?

A fact-checker reads the answer against the cited sources and flags any sentence the sources don't actually support.

Once retrieval is good, evaluate the answer for faithfulness: is every claim supported by the retrieved chunks, with nothing invented and nothing contradicted? Also check it actually used what was retrieved and answered the question. This is where you catch the model drifting off solid context — and it's measurable, often with another model as a grader.

Build the eval from real questions

You test a bridge with the trucks that will actually cross it, not with weights you find convenient.

Your eval set should be real questions users ask, including the messy and the out-of-scope ones, each paired with the chunks that should be retrieved and a known-good answer. Run it on every change so you can see whether a new chunking strategy or reranker actually helped. Without this, you're tuning RAG by anecdote — and anecdotes lie.

RAG fails in two places. Measure retrieval and generation separately, fix retrieval first, and build the eval from real questions.

§ 07

RAG is powerful and often over-reached for. The skill is using it when it earns its place, keeping the data fresh, and not paying for machinery a simpler approach would have skipped.

Don't RAG what you could just paste

You don't build a library catalogue for the three books on your desk — you just open them.

If the relevant facts are small and fixed — a few documents, a short policy — just put them in the prompt. RAG earns its complexity when the knowledge is too big for the context, changes often, or must be filtered per user. Building a retrieval pipeline over three pages is the over-engineering equivalent of an agent loop for a single call. Reach for it when the data forces you.

Stale data is wrong data

A phone book is only useful if it's reprinted — last decade's numbers are confidently, uselessly wrong.

RAG's promise is current facts, which means your index has to stay current. When source documents change, the chunks and embeddings must be updated, or the model will ground its answers in confident, outdated text. Plan reindexing as part of the system, not an afterthought — freshness is a feature you have to maintain, not a property you get once.

Watch cost and latency

Every extra step in the pipeline — embed, search, rerank, stuff context, generate — adds a little time and a little money to every single question.

RAG adds stages, and each one costs latency and tokens. Retrieving more chunks, reranking, larger context — all improve quality up to a point and raise the bill the whole way. So tune for the fewest chunks that answer well, cache where you can, and remember that a great answer that's too slow or too expensive isn't a great answer in production.

RAG is one rung; the model stays a component

A good kitchen has a pantry, but the pantry isn't the cook — it's one well-organised input to someone who still has to make the dish.

RAG sits on the ladder above a plain prompt and below tools and agents: reach for it when the model needs your facts, and keep it behind a clean interface like any dependency. Retrieval feeds the model; it doesn't replace the judgement around the model. The same discipline as everywhere — use the simplest rung that works, and keep the model swappable.

Before you ship a RAG feature

Is RAG even needed — or are the facts small and stable enough to paste into the prompt? - How are documents chunked — on meaning, self-contained, with metadata and a little overlap? - Is retrieval hybrid and reranked, or a raw vector search you never measured? - Does the model answer only from retrieved text, cite sources, and say "I don't know"? - What's the eval for retrieval, separate from generation? - How does the index stay fresh, and what does a query cost in time and tokens?

Smell tests that retrieval is the problem

Answers are confidently wrong, and you never looked at the chunks that were retrieved. - Raw vector search only — no keyword fallback, no reranking. - Chunks split by character count, cutting through sentences and tables. - You're tweaking the prompt and the model to fix what is actually a retrieval miss. - The index hasn't been rebuilt since the source documents changed.

Signs you built it well

Chunks are semantic and self-contained, tagged with metadata for filtering and citations. - Retrieval is hybrid + reranked, tuned to the fewest chunks that answer well. - Answers are grounded and cited, with an honest refusal when the facts aren't retrieved. - You measure retrieval and generation separately, and fixed retrieval first. - The index has a freshness plan, and RAG sits behind an interface you could swap.

RAG is a search system with a model on the end. Get the chunks right and the answer follows — the model was rarely the problem.