Express course · No. 33
A normal model blurts the answer in one pass. But for hard problems, letting it work through the steps first — reason before responding — dramatically improves the result. Reasoning models and 'test-time compute' are about spending more effort at the moment of answering to get a better answer. It's powerful for hard problems and wasteful for easy ones — so the skill is knowing when to turn the thinking up.
Essence only · One picture per idea · Thinking has a price
The whole idea starts from a simple observation: a model that works through a problem step by step does better than one that jumps straight to the answer — just like a person.
Blurting the answer versus working it out
A student who shouts the first number that comes to mind versus one who works the problem out on scratch paper first — the second gets it right far more often.
By default, a model produces its answer in a single pass, generating the response directly — effectively blurting it out. For easy questions that's fine. But for anything requiring several steps of logic, jumping straight to the answer is where it stumbles, the same way a person rushing a hard problem makes careless mistakes. The fix is the same one teachers demand: don't just give the answer, work through the steps — and a model that does is markedly more accurate on hard problems.
Chain-of-thought: reason out loud, step by step
Showing your work on a maths problem — writing each step in order — so the reasoning is laid out and a slip in the middle is caught instead of buried.
Chain-of-thought is the technique of having the model generate its reasoning step by step before the final answer, instead of producing the answer alone. Asked to "think it through," the model writes out the intermediate steps — and reasoning through them visibly tends to produce a correct conclusion far more often than leaping to it. The steps aren't just for show: generating them is what lets the model build up to a right answer rather than guessing one. Thinking on the page improves the thinking.
The steps are where the work happens
A long calculation done in your head is error-prone; the same calculation on paper, one line at a time, is reliable — the paper carries what the mind would drop.
Why does writing the steps help? Because each generated step becomes context the model can build on for the next one — it's effectively using its own output as working memory, breaking one hard leap into a chain of small, manageable moves. A problem too big to solve in a single jump becomes solvable as a sequence. This is the core insight behind everything in this course: giving the model room to reason, rather than demanding an instant answer, is what unlocks hard problems.
A model that works through the steps beats one that blurts the answer. Chain-of-thought — reasoning out loud, step by step — turns one hard leap into a chain of manageable moves, and accuracy follows.
What began as a prompting trick became a kind of model. Reasoning models are trained to think before they answer, building the step-by-step process in by default.
Models trained to think before answering
The difference between someone who reflexively answers and someone trained to pause and reason first — the habit of thinking is built into how they work, not something you have to ask for.
A reasoning model is one specifically trained to generate an internal chain of thought before its final answer — to "think" first by default, rather than only when prompted to. Where a standard model blurts and a chain-of-thought prompt coaxes it to reason, a reasoning model has that step-by-step process built into how it operates. It produces a stretch of internal reasoning, then the answer, and is markedly stronger on problems that need real working-out.
They trade speed for depth
A careful expert who takes a minute to think before answering gives better answers than a fast one who replies instantly — but you wait for it.
A reasoning model spends more effort and time per answer, generating all those internal steps before responding. That makes it slower and more expensive than a standard model, in exchange for being better at hard problems. It's a genuine trade, not a free upgrade: you're paying — in latency and tokens — for the depth of thinking. So a reasoning model isn't simply "better"; it's a different tool, suited to problems where the extra thinking earns its cost and overkill where it doesn't.
They're a tool for hard problems, not a default
You bring in the deep-thinking specialist for the genuinely hard case, not to answer the front desk's routine questions — the wrong job wastes their gift.
A reasoning model is the right tool when the task genuinely needs careful, multi-step thought — and the wrong one for simple, fast work, where its deliberation is pure waste. Using a reasoning model to classify a message or reformat some text is like hiring a philosopher to answer the phone: slower and pricier for no benefit, since the task never needed the thinking. The reasoning model is a powerful instrument for the hard slice of work, not a replacement for fast, ordinary models on everything else.
A reasoning model is trained to think step by step before answering — stronger on hard problems, slower and pricier in exchange. It's a tool for the hard cases, not a default for everything.
Underneath reasoning models is a deeper idea that's reshaping how AI improves: you can make a model better not only by training it more, but by letting it work harder at the moment it answers.
Spend more effort at answer time
Given more time on an exam, you check your work, try another approach, and catch mistakes — the same person scores higher simply by being allowed to spend longer.
Test-time compute means spending more computation at the moment of answering — at "inference," when the model runs — to get a better result. Instead of one quick pass, the model thinks longer, generates more reasoning, perhaps tries multiple approaches and picks the best. The striking finding behind reasoning models is that letting a model do more work when it answers improves quality, much like giving a person more time on a hard problem. You can buy better answers with more thinking, not just more training.
A dial for how hard to think
A thermostat for effort: turn it up for the hard problem and down for the easy one, spending exactly as much thinking as the task is worth.
Test-time compute is a dial, not a switch: you can spend a little thinking or a lot, and more generally yields better results on hard problems — up to a point of diminishing returns. This is powerful because it lets you tune effort to difficulty: crank the thinking up for the genuinely hard case, keep it low for the routine one. The ability to trade more compute for more accuracy, per request, is a flexible lever — and knowing it exists changes how you approach hard tasks.
Training-time versus test-time
The difference between a student studying harder before the exam, and the same student being given more time during it — two different ways to get a better result.
Historically, models got better mainly by training harder — more data, bigger models, done once up front. Test-time compute is a different axis: improving the answer by working harder at inference, every time the model runs. It matters because it's a second way to get more capability — not just a smarter model, but the same model thinking longer. Understanding that quality can come from training-time or test-time effort helps you reason about where a model's performance — and its cost — is actually coming from.
Test-time compute means spending more effort when the model answers, not just in training — and more thinking yields better answers on hard problems. It's a dial you turn up for difficulty, down for routine.
Reasoning is powerful exactly where the problem is hard, and pointless exactly where it's easy. Knowing which is which is most of using it well.
Hard, multi-step problems benefit most
A complex route with many turns rewards careful planning; a straight road to the next house doesn't — the harder the path, the more thinking pays off.
Reasoning helps most on problems that genuinely require several steps of logic to get right: maths and calculation, multi-step planning, complex code, logical puzzles, careful analysis where one wrong step ruins the answer. These are exactly the tasks where blurting fails and working through the steps succeeds. The harder and more multi-step the problem, the more the extra thinking improves the result — which is why reasoning models shine on benchmarks full of genuinely difficult problems.
Simple tasks gain nothing
Deliberating at length over what to have for lunch when you'd be happy with either option — all that thinking produces the same answer, slower.
For simple, direct tasks — classify this, extract that, reformat this text, answer a basic factual question — reasoning adds nothing but delay and cost. There's no multi-step logic to work through, so the thinking is wasted motion; the answer was obvious in one pass. Worse, making a model "think" about a trivial task can occasionally make it worse, overcomplicating something that needed no deliberation. Match the thinking to the problem: easy tasks want a fast answer, not a thoughtful one.
Match the depth of thought to the difficulty
A good worker spends a long time on the hard decision and answers the easy one instantly — calibrating effort to what each actually requires.
The governing principle is to scale how hard the model thinks to how hard the problem is. Genuinely difficult, high-stakes, multi-step work earns a reasoning model or more test-time compute; simple, routine, well-scoped work gets a fast standard model. This mirrors the routing idea from model economics: most requests are easy and want speed, a minority are hard and want thinking. Sending everything to a reasoning model is as wasteful as sending everything to the frontier — calibrate, don't default.
Reasoning helps most on hard, multi-step problems and nothing on simple ones. Scale the depth of thought to the difficulty — a reasoning model for the hard slice, a fast model for the easy majority.
Thinking isn't free. More reasoning means more time and more tokens, and ignoring that cost is how teams end up paying a fortune to slowly answer easy questions.
More thinking means more tokens and money
A meter that runs the whole time someone is deliberating — the longer they think, the bigger the bill, whether or not the extra thought was needed.
All those reasoning steps are generated tokens, and you pay for them. A reasoning model or a high test-time-compute setting produces a lot of internal thinking before the answer, and every token of it costs money — so reasoning is meaningfully more expensive per answer than a single direct pass. The depth that makes it better on hard problems is exactly what makes it pricier. This is why "just use the reasoning model for everything" wrecks an AI budget: you pay for thinking on every request, including the ones that needed none.
It's slower, which matters for users
The careful expert who takes a minute to answer is worth the wait for a hard problem, but maddening if you just asked the time.
Generating all those reasoning steps takes time, so reasoning models and heavy test-time compute are slower — sometimes much slower — to produce an answer. For a background task that's fine; for a user waiting on a screen, a long delay is a real cost to the experience. So the latency of thinking is part of the trade: the extra seconds are worth it for a hard problem the user expects to take a moment, and a poor fit for an interaction that should feel instant. Speed is a feature you spend when you turn the thinking up.
Don't pay for thinking the task doesn't need
Hiring the slow, expensive deep-thinker to answer simple questions all day — you're paying premium rates and waiting longer for answers a quick clerk would give instantly.
The waste to avoid is spending reasoning's cost and latency on tasks that don't benefit. Routing every request — easy and hard alike — through a reasoning model means paying the thinking tax on the whole stream, when only a slice needed it. The same discipline as model economics applies: use the cheaper, faster, non-reasoning path by default, and escalate to reasoning only for the genuinely hard cases that earn its cost. Pay for thinking where it pays you back, and not a token more.
Thinking costs tokens and time — reasoning is pricier and slower per answer. Pay for it only where the hard problem earns it; routing everything through a reasoning model taxes the whole stream for a benefit only a slice needs.
It's tempting to trust a long, careful-looking chain of reasoning. But more steps don't guarantee a right answer — a reasoning model can be confidently, elaborately wrong.
A long chain can still reach a wrong answer
A detailed, confident argument that's built on a flawed premise — every step follows neatly, and the conclusion is still wrong. Polish isn't proof.
Reasoning improves the odds of a correct answer on hard problems; it doesn't guarantee one. A model can produce a long, fluent, plausible-looking chain of thought that leads to the wrong conclusion — a mistake early in the chain carried confidently to the end, or reasoning that sounds rigorous but isn't. The visible steps make the answer feel more trustworthy, which is exactly the trap: more reasoning looks more authoritative without necessarily being more correct.
The reasoning may not be the real reason
Someone who decides on a hunch and then invents a logical-sounding justification afterward — the explanation is real-sounding but isn't actually how they got there.
There's a subtler trap: the chain of thought a model shows isn't guaranteed to be the actual process that produced its answer. It can generate reasoning that looks like the path to the conclusion while the real basis was something else — a plausible-sounding rationalisation rather than a faithful account. So you can't fully trust the displayed reasoning as an explanation of why the answer is what it is. It's a useful signal and an aid to accuracy, not a guaranteed window into the model's true logic.
Verify the answer, don't trust the thinking
You check a complex calculation by confirming the result independently, not by admiring how neat the working looks — the answer's correctness is what counts.
The practical takeaway: judge a reasoning model's output by whether the answer is correct, verified against reality, not by how impressive the reasoning looks. All the disciplines from elsewhere still apply — ground it in real facts, check it against sources, run evals, keep a human on the high-stakes calls. Reasoning makes hard answers more likely to be right; it doesn't make them safe to trust unverified. A confident chain of thought is still a confident guess until you've checked where it landed.
More reasoning improves the odds, not the guarantee — a long chain can reach a wrong answer, and the shown steps may not be the real reason. Verify the answer; don't trust the thinking.
Reasoning is a powerful capability with a clear price, so using it well is the now-familiar discipline: spend the thinking where it pays, and verify what it produces.
Turn the thinking up only for the hard part
You concentrate hard on the one tricky step and breeze through the rest — focusing your effort where the difficulty actually is, not spreading it evenly.
The unifying move is to match thinking to difficulty across your system: reach for a reasoning model or more test-time compute on the genuinely hard, multi-step, high-stakes tasks, and use fast standard models for the easy majority. You can even mix them — a fast model handles the routine path and hands off the hard sub-problems to a reasoning model. Spending deliberation where the problem is hard, and only there, gives you the accuracy gain without paying the thinking tax across everything.
Reasoning raises accuracy; it doesn't replace verification
A careful expert is more likely to be right, but you still confirm the high-stakes decision — their care improves the odds, it doesn't remove the need to check.
The final discipline ties this course to the rest: reasoning improves quality on hard problems, but it's still a fallible model that can be confidently wrong, so everything else still matters. Ground it, verify the answer, run evals, keep humans on the consequential calls. Reasoning is one more way to get better outputs — alongside good context, retrieval, and the right model — not a magic upgrade that removes the need to engineer reliability around it. Better thinking is a stronger ingredient, not a finished dish.
- Is the problem genuinely hard — multi-step logic, math, planning — that thinking would untangle? - Or is it simple — where reasoning just adds delay and cost for no gain? - Is the cost and latency worth it here, or are you taxing easy tasks with thinking? - Are you routing — reasoning for the hard slice, fast models for the easy majority? - Are you verifying the answer, not trusting a thorough-looking chain of thought? - Do the usual disciplines still apply — grounding, evals, a human on high-stakes calls?
- chain-of-thought — having the model reason step by step before the final answer. - reasoning model — a model trained to think first by default, stronger on hard problems. - test-time compute / inference — spending more effort when the model answers to get a better result. - the thinking dial — tuning how much the model deliberates, from a little to a lot. - training-time vs test-time — improving by training more, versus working harder at answer time.
- the cost of thinking — more reasoning means more tokens, money, and latency. - reasoning isn't truth — a long chain can still be confidently wrong; verify the answer.
- You turn the thinking up only for genuinely hard problems, not by default. - You use fast models for the easy majority and reserve reasoning for the hard slice. - You account for reasoning's cost and latency instead of taxing every request. - You verify the answer rather than trusting how thorough the reasoning looks. - You still apply grounding, evals, and human oversight — reasoning sharpens, it doesn't replace them.
Reasoning lets a model think before it answers, raising accuracy on hard problems at the cost of time and tokens. Turn the thinking up only where the difficulty earns it, and verify the answer — better thinking is still a fallible guess until checked.