METHODOLOGY · July 3, 2026

You can't unit-test a dice roll

Developers bolt an LLM into a system, write a normal pass/fail test around it, watch it flake, and then either delete the test or mock the model into meaninglessness. Both are wrong. A probabilistic component isn't broken when it varies — but 'it varies' is not permission to stop testing it. You just have to test the distribution instead of the sample: score a golden set with tolerance, gate on a pass rate, assert the invariants that must hold every time, and keep a hard line between the stochastic core and the deterministic shell around it.

Here's a bug I've watched smart engineers ship over and over. They wire an LLM into a feature, then do the responsible thing and write a test: input goes in, assert the exact expected string comes out. It passes on their machine. It fails in CI. It passes again on a re-run. So they do one of two things — delete the test, or mock the model to return a canned answer — and now the one part of the system most likely to misbehave is the one part with no test at all.

The mistake isn't the flaky test. It's the frame. You're trying to unit-test a dice roll.

Assertion-on-one-run is the wrong tool

A unit test asks a yes/no question about a deterministic function: given x, do I get exactly y? That's the right question for a parser and a nonsense question for a model, which is allowed — by design — to say the same true thing five different ways. Run it enough and the same prompt gives you a distribution, not a value. Asserting on a single draw from that distribution tells you almost nothing; it just moves the coin flip into your CI pipeline.

This is why the productivity numbers on AI coding are so muddy. Individuals feel fast, but experienced developers were measured slower on real tasks, and teams see more pull requests but longer reviews and rising churn. A lot of that is people trying to validate non-deterministic output with deterministic habits, one brittle assertion at a time, and drowning.

"It's not deterministic" is a true statement about the model and a lazy excuse for the system. You can't pin the die. You can absolutely bound how it's allowed to land.

Test the distribution, not the sample

The move is to stop testing for one right answer and start testing the shape of the behavior:

Score a golden set, don't match a string. Keep 30–100 real input/output pairs and grade new runs by similarity, a rubric, or a model-as-judge — not exact equality. You're measuring quality, which has a range, not identity, which doesn't.
Gate on a pass rate, not a single green. Run the case N times and require, say, 95% to pass. One failure in twenty isn't a broken build — it's the die behaving like a die. A pass rate crossing a line is a regression you can catch.
Assert the invariants that must hold every single time. The content varies; the contract must not. Always valid JSON. Never leaks another user's data. Always inside the token budget. Refuses the three prompts it must refuse. Those are deterministic, and you test them like anything else.
Split the stochastic core from the deterministic shell. The routing, parsing, validation, retries, and fallbacks around the model are ordinary code — unit-test them hard, with the model mocked. Save the probabilistic testing for the one boundary that's actually probabilistic.

The bottom line

Non-determinism isn't a reason to skip testing — it's a reason to test differently. A model you can't pin down is exactly the part of your system that most needs a measured leash, because "it usually works" is how unreliable software feels from the inside right up until it doesn't.

Stop asserting on one lucky roll. Grade a golden set with tolerance, gate on a pass rate, lock down the invariants, and unit-test the deterministic shell like your life depends on it — because in production, that shell is what's holding the dice.

Comments

No comments yet

Be the first to share a thought.