All notes
The biggest context window doesn't win

June 13, 2026

The biggest context window doesn't win

Every model launch brags about a bigger context window — a million tokens, two million, the whole codebase at once. But an analysis of enterprise deployments found that nearly 65% of agent failures came from context drift or memory loss during multi-step work, not from a window that was too small. The teams shipping reliable agents in 2026 aren't the ones with the biggest window. They're the ones who curate hardest what the model actually sees. Here's the difference, and why more is often worse.

Every model release leads with the same brag: a bigger context window. A million tokens. Two million. "Fit your whole codebase in one prompt." It sounds like the answer to making agents reliable — just give the model everything and let it sort it out.

It isn't, and the data says so plainly. An analysis of enterprise AI deployments found that nearly 65% of agent failures came from context drift or memory loss during multi-step reasoning — not from a window that was too small. The conclusion from the people shipping reliable long-running agents in 2026 is blunt: the teams that win aren't the ones with the biggest context window, they're the ones with the most rigorous context management.

That flips the intuition most people start with, so let me unpack it.

More context is not more understanding

The instinct is that a model is like a student who'd do better with more notes. But a model doesn't read your context the way you'd hope. Bury the one relevant fact in a million tokens of mostly-irrelevant material and the model's attention spreads thin — it weighs the noise alongside the signal, gets pulled by whatever's nearby, and loses the thread. The industry even has a name for it now: context rot. The window got bigger; the model's ability to use all of it well did not keep pace.

So "just stuff everything in" trades one problem for a worse one. You stop worrying about what to include, and you start losing to everything you shouldn't have included. A big window makes it possible to give the model too much. It doesn't make that a good idea.

Drift is the real killer

The 65% number points at something specific: the failures happen during multi-step work, as context drifts. An agent doing a long task accumulates state — earlier steps, tool outputs, half-finished reasoning — and across twenty steps that pile grows messy. The original goal slides out of focus. A stale fact from step three contradicts a fresh one from step fifteen, and the model can't tell which to trust. By the end it's reasoning over a polluted picture of its own making.

This is why a bigger window doesn't save you. It gives the drift more room to accumulate, not less. The fix isn't capacity — it's hygiene: deciding, at each step, what the model should still be carrying and what should be dropped.

What context management actually looks like

The reliable-agent teams treat context as something to engineer, not a bucket to fill:

  • Curate, don't dump. Give the model the few things this step needs, not everything the task might touch. "Dumb RAG" — shoving every retrieved document into the prompt — is a named failure mode for a reason.
  • Compress as you go. Summarize the finished steps into a short running state instead of dragging the full transcript forward. The model carries the conclusion, not the raw history.
  • Scope the tools. Fewer, sharper tools in context beat a giant menu the model has to reason through every turn.
  • Refresh the goal. Re-anchor the original objective at each step so it doesn't erode under the weight of everything that's happened since.

None of that needs a bigger window. Most of it works better in a smaller one, because a tight context is a focused one.

The bottom line

The context window is a spec-sheet number, and like most spec-sheet numbers it measures capacity, not skill. A two-million-token window tells you what the model can ingest; it tells you nothing about whether feeding it that much will help — and the failure data says it usually hurts. The reliability of an agent is decided by what you choose to put in front of it at each step, which is work the window size will never do for you.

So the next time a launch leads with a record-breaking context window, read it as what it is: more room, not more understanding. The teams whose agents actually hold together aren't filling the window. They're guarding it — and that discipline, not capacity, is what separates an agent that works from one that quietly drifts into nonsense.

Comments

No comments yet

Sign in to join the conversation.

Be the first to share a thought.