Express course · No. 13
An AI feature is built on a non-deterministic component — the same input can give different output, and 'it looked good when I tried it' is not a test. An eval is a test suite for that component: known inputs, scored outputs, run on every change. It's the unglamorous discipline that separates a demo you hope works from a product you know does.
Essence only · One picture per idea · Measure, don't guess
You can unit-test ordinary code because the same input gives the same output. An LLM breaks that, so you need a different kind of test — and without it you're flying blind.
You can't unit-test a guesser
Grading an essay isn't like checking a sum — there's no single correct string to match against, and the same prompt yields a slightly different essay each time.
A normal test asserts output === expected. An LLM is non-deterministic and open-ended: the same prompt gives varying text, and "correct" is often a quality, not an exact string. So the assertion model breaks. An eval replaces it: a set of inputs, each with a way to score how good the output is, run repeatedly — testing built for a component that doesn't return the same thing twice.
Vibes don't survive contact with change
A cook who never tastes the dish, just tweaks the recipe and hopes — they can't tell if last night's change helped or quietly ruined it.
Without an eval, you tune prompts by vibes: try something, eyeball a few outputs, ship if it feels better. The trap is that you can't see regressions — a change that fixes one case often breaks three you didn't re-check. An eval turns "feels better" into "scored 84% versus 79%," which is the difference between engineering and guessing.
Getting to 80% is fast; 95% is the whole project
The first draft of anything comes quick. Turning a rough draft into something dependable is where the real time goes.
Here's the hard truth evals exist to manage: a demo that's right 80% of the time takes an afternoon; grinding from there to 95% is most of the work. That last stretch is found case by case — the edge inputs, the rare failures — and you can only find and fix them if you're measuring. The eval is the instrument that makes the long climb visible.
An eval is your regression net and your spec
A safety harness lets a climber try a hard move because a slip won't be fatal — it makes bold changes safe.
Once you have an eval, you can change the prompt, swap the model, or restructure the pipeline and immediately see if it helped or hurt. It catches regressions before users do, and it quietly becomes the real specification of "what good looks like" for your feature. Without it, every change is a gamble; with it, improvement becomes a loop you can actually run.
You can't unit-test a non-deterministic component. An eval scores it instead — and without one, every change is a guess.
"Is it good?" isn't measurable. The first real work of evals is turning that vague question into specific things you can actually score — chosen to match the job your feature does.
Turn 'good' into concrete checks
A driving test doesn't ask "are you a good driver?" — it scores specific things: did you signal, stay in lane, stop fully, park within the lines.
"Quality" is too vague to optimise. Break it into measurable properties: was the answer correct, was it grounded in the source, was the format valid, did it call the right tool, was it the right length and tone? Each is something you can score. The skill is choosing the few properties that actually matter for your task, not measuring everything.
Match the metric to the job
You judge a translator on faithfulness and fluency, a cashier on speed and accuracy — wrong yardstick, useless review.
Different features need different metrics. A RAG system: did it retrieve the right chunks, and is the answer faithful to them? A classifier: accuracy against labels. An agent: did it pick the right tools and reach the goal? A summary: faithful, complete, concise. Pick metrics that map to what failure actually means for your task, or you'll optimise a number that doesn't matter.
Faithfulness and correctness are different things
A student can write a beautifully argued answer that's wrong, or a clumsy one that's right. Style and truth are separate scores.
Two of the most important properties get conflated. Correctness is whether the answer is actually right; faithfulness is whether it's supported by the provided context without invention. A grounded system can be faithful (it used the source) yet wrong (the source was wrong), or correct yet unfaithful (right by luck, not from the context). Measuring them separately tells you where the failure is.
One number hides too much
An average temperature of "comfortable" can hide a freezing room and a sweltering one. The aggregate erases the cases you most need to see.
A single overall score is a starting point, not the picture. Break results down by category and difficulty — easy versus hard, each question type, each failure mode — because a flat 85% can be 100% on easy cases and 40% on the ones that matter. The value of an eval is in the breakdown that shows you exactly where to aim next.
"Is it good?" isn't a metric. Break quality into concrete, scoreable properties chosen for your task — and never trust a single aggregate number.
An eval is only as honest as the examples in it. The dataset — the inputs and their known-good outcomes — is the real asset, and most of the care goes here.
A golden set: inputs with known-good outcomes
A teacher's answer key — the questions paired with what a correct response looks like, so grading is consistent instead of moody.
The core of an eval is a golden set: a collection of representative inputs, each with a known-good answer or a clear definition of success. This is what you score against. Building it is real work — it's where you pin down what "correct" even means for your feature — and it's the asset that makes every future change measurable.
Cover the edges and the known failures
A bridge is stress-tested with the heaviest trucks and the worst storms, not just an average car on a sunny day.
A set of only easy, typical inputs gives you a flattering, useless score. Deliberately include the hard cases: ambiguous questions, edge inputs, adversarial prompts, and especially the failures you've already seen. Every time a bug reaches production, add it to the eval so it can never silently return. The set should concentrate on where the system is most likely to break.
Harvest real production examples
The best test questions come from the actual exam students sat, not from what the teacher imagined they'd ask.
The richest source of eval cases is real usage: what users actually asked, especially what they asked that went wrong. Mine your logs and feedback for real inputs and fold them into the set. Invented examples reflect what you think happens; production examples reflect what does, including the messy phrasing and odd requests you'd never have dreamed up.
Small and sharp beats big and stale
Fifty well-chosen questions that probe the real weak spots teach you more than a thousand random ones you never look at closely.
You don't need thousands of cases to start — a few dozen well-chosen ones, covering the main behaviours and the known failure modes, is enough to catch most regressions and guide tuning. Start small, keep it honest, and grow it as you learn where the system breaks. An eval you actually run beats a huge one you don't.
The golden set is the asset. Pair real inputs with known-good outcomes, concentrate on the hard and the broken, and grow it as you learn.
Once you have inputs and expectations, you need a way to score each output. The methods run from cheap and exact to flexible and fallible — and you reach for the cheapest one that fits.
Use exact and rule-based checks where you can
Marking a multiple-choice test needs no judgement — the answer is right or it isn't, and a machine grades it instantly.
When the output is constrained, grade it deterministically: exact match for a classification, a regex or schema check for format, did-it-call-the-right-tool for an agent, is-the-number-correct for an extraction. These checks are cheap, fast, and unarguable. Always push as much of your eval as possible into this category — it's the most reliable grading there is.
LLM-as-judge for the fuzzy stuff
For an essay you bring in a second examiner with a clear rubric — judgement, but guided by explicit criteria so it's consistent.
For open-ended outputs — is this summary faithful? is this answer helpful? — there's no exact match, so you use a model to grade, LLM-as-judge: give a second model the output, the context, and a clear rubric, and have it score. It scales human-like judgement across thousands of cases. It's the standard tool for qualities you can't check with a rule — but it comes with caveats.
The judge needs judging too
A second referee can be biased — favouring the longer answer, or the one that sounds confident — so you check the referee against a few human calls before you trust them.
An LLM judge has failure modes: it can prefer verbose answers, be swayed by confident tone, or drift from your intent. So validate the judge against human grades on a sample, give it a sharp rubric with examples, and keep it simple. A judge you haven't checked is just another unmeasured model — don't trust its scores until you've confirmed they track what you actually care about.
Keep humans in the loop, in small doses
Even with automated grading, a chef still tastes the food — a quick spot-check that the instruments haven't quietly gone wrong.
Automated grading scales, but periodically read the actual outputs yourself. Spot-check a sample, look hard at the failures, and sanity-check that your metrics still reflect real quality. Human review doesn't scale to every case, but a small, regular dose catches the things your automated scores miss — and stops you optimising a number that has drifted from reality.
Grade deterministically where you can, use an LLM judge for the fuzzy rest — and validate the judge, because an unchecked grader is just another guess.
Agents need more than a final-answer grade, because they can reach a clean answer through a broken process. To trust an agent, you measure the journey, not just the destination.
A clean final answer can hide a broken middle
A maths exam graded only on the final number passes the student who reached it through two cancelling errors — you learn nothing about what actually happened.
In a multi-step agent, an intermediate mistake can pass a final-output check while corrupting the process — it retrieves the right source, misattributes a fact midway, and writes a clean summary that's wrong. Grade only the final answer and you wave that through. The failures you most need to catch are the ones a destination-only check is blindest to.
Grade the trajectory
A driving examiner watches the whole drive — every signal, mirror check, and lane change — not just whether you arrived at the address.
For agents you evaluate the trajectory: at each step, did it choose the right tool, pass the right arguments, retrieve the right context, and reason soundly? This catches the broken middle a final check misses, and it tells you which step failed — far more actionable than knowing only that the end was wrong. The steps are where an agent's reliability is actually made or lost.
Check tool calls and retrievals directly
Auditing a research assistant means checking which sources they actually pulled and which calls they made — not just reading their final memo.
Much of an agent's behaviour is concrete and gradeable without judgement: did it call send_email when it should have, with the right recipient? did it retrieve the relevant document? did it stay within its tool permissions? These are deterministic checks on the steps, cheap and exact, and they pin down a large share of agent failures precisely.
Measure the boring operational truths too
A delivery service is judged not only on whether the parcel arrived, but on how long it took, how many tries, and what it cost.
Beyond correctness, track an agent's operational metrics: how many steps it took, how often it looped or retried, latency, and cost per task. A loop that reaches the right answer in twenty expensive steps when three would do is a reliability and cost problem your final-answer score won't show. These numbers reveal agents that work but won't survive production.
An agent can reach a clean answer through a broken process. Grade the trajectory — the tools, the retrievals, the steps — not just the destination.
Evals come in two kinds that do different jobs: the test suite you run before shipping, and the measurement you keep running in production. You need both, because the world isn't your test set.
Offline evals: the gate before you ship
A test track where you put a car through its paces before it ever carries a passenger — controlled, repeatable, run before release.
The offline eval is your golden set, run against a change before it ships: did this new prompt or model score better on the cases you curated? It's controlled and repeatable, the place you catch regressions and compare options safely. This is the eval you wire into the development loop — ideally a gate in CI so nothing ships that drops the score.
Online evals: measuring the real world
No matter how good the test track, you still monitor the cars once they're on real roads — because real roads have potholes the track never had.
Production behaves in ways no offline set fully predicts: real users phrase things you didn't imagine, and the input distribution shifts. So you measure online too — sample and score live outputs, watch quality metrics, and collect user signals. The offline set proves a change is safe to ship; online measurement tells you what's actually happening once it's out there.
User feedback is a free eval signal
A thumbs-up button and a complaint inbox are a continuous, free survey of whether the thing is actually working for the people using it.
Your users are grading you whether you collect it or not. Capture the signals — thumbs up/down, corrections, retries, abandonment, escalations — and feed them back. A spike in retries or thumbs-down is an early warning; the specific failures become tomorrow's eval cases. This closes the loop: production teaches the eval set what reality looks like, and the eval set protects the next release.
Watch for drift
A scale slowly creeps out of calibration — nothing dramatic, just every reading a little more off until one day the numbers are wrong.
A system that scored well can degrade silently: a model provider updates, your data shifts, usage patterns change. Without ongoing online measurement, you find out from angry users instead of a dashboard. Track quality over time so drift shows up as a trend you can act on, not a surprise. The eval is not a one-time gate; it's a vital sign you keep reading.
Offline evals gate the release; online evals watch the reality. The first proves a change is safe, the second tells you what's actually happening.
Evals are a practice, not a one-off. The skill is starting before you feel ready, wiring measurement into the loop, and not letting the number become a goal that lies to you.
Eval first, then tune
You calibrate the thermometer before you start adjusting the oven — otherwise every adjustment is guesswork dressed as progress.
The common mistake is endlessly tweaking the prompt and only later, maybe, building an eval. Flip it: write a small eval before you optimise, so every change is measured from the start. Even ten cases beat zero. Tuning without an eval feels productive and mostly isn't — you're moving numbers you can't see.
Make the eval a CI gate
Code doesn't merge if the tests fail — and the same discipline keeps a quietly-worse prompt from reaching users.
Once you have an eval, run it automatically on every change, and block the release if the score drops below a bar. This turns evals from a thing you do when you remember into a guardrail that's always on — the AI equivalent of a passing test suite. "Evals or it didn't ship" is the standard worth holding: an unmeasured change to an AI feature is an untested deploy.
Don't let the metric become the target
Teaching purely to the test produces students who ace the exam and can't do the job — the score went up, the thing it measured did not.
When a measure becomes a target, it stops being a good measure — Goodhart's law. You can overfit prompts to your specific eval cases and watch the score rise while real quality stalls. Guard against it: keep some cases held out, refresh the set with new production failures, and remember the eval is a proxy for user value, not the value itself. A rising number that users don't feel is a warning, not a win.
Start small, grow with the failures
A good test suite isn't written all at once — it accumulates, one case per bug, until it quietly covers everything that ever went wrong.
Don't wait for a perfect, comprehensive eval — you'll never ship. Start with a handful of cases for the core behaviour, and add one every time something breaks. Over time the set grows precisely along the contours of where your system actually fails, which is exactly where you want your coverage. The eval matures with the product, and it's never truly finished.
- Is there an eval at all — a set of inputs with a way to score the outputs? - Do the metrics match the job, broken down by difficulty and failure type, not one flat score? - Does the set include the hard, weird, and previously-broken cases, drawn from real usage? - Is grading the cheapest reliable method — deterministic where possible, a validated judge where not? - For agents, are the steps graded, not just the final answer? - Does it run in CI, and how will you measure quality once it's live?
- Tuning prompts by reading a few outputs and shipping when it feels better. - A single aggregate score with no breakdown by case type or difficulty. - An eval set of only easy, typical inputs — no edge cases, no past failures. - An LLM judge you never validated against human grades. - No production monitoring — you'd learn of a quality drop from a complaint.
- A golden set of real inputs with known-good outcomes, concentrated on the hard cases. - Metrics that map to the task, scored by the cheapest reliable method and broken down. - Step-level grading for agents, and a validated judge for the fuzzy parts. - The eval is a CI gate, and a regression adds a case rather than slipping through. - Online measurement and user feedback feed new failures back into the set.
Evals aren't a one-time test. They're the loop — measure, find the failures, fold them back in — that turns a non-deterministic component into something you can trust and improve.