ARCHITECTURE · July 1, 2026

Your million-token context window is lying to you

Vendors sell context length like RAM: bigger is strictly better, just stuff everything in. But attention isn't uniform. Studies keep finding the same U-shape — a model reliably uses the start and end of its window and quietly ignores the middle, with accuracy dropping 30%+ once the important thing is buried in there, sometimes after just 10k tokens. Context isn't a bucket you fill. It's a scarce, positional resource you engineer. 'Put it all in the prompt' is the new premature optimization.

The million-token context window arrived and everyone drew the obvious conclusion: retrieval is over, pipelines are over, just paste the whole codebase / all the docs / the entire history into the prompt and let the model sort it out. It's a seductive story. It's also wrong in a way that will quietly degrade your product while every token counter says you're fine.

The window is not uniform, and the middle is a graveyard

The uncomfortable finding, replicated over and over, is that a model does not attend to its context evenly. It leans hard on the start and the end, and the middle gets skimmed — the "lost in the middle" U-curve. Chroma's "context rot" work found accuracy dropping 30%+ when the relevant fact sits mid-window, with measurable degradation after only ~10k tokens — regardless of the million-token number on the box. NVIDIA's own 2026 guidance is blunt about it: keep your prompt under a third of the stated window.

A bigger context window doesn't mean the model reads more. It means it has more room to ignore the thing you needed it to see.

Think about what that does to "just stuff everything in." You put the critical instruction, or the one relevant function, or the clause that actually matters — somewhere in the vast middle of a 400k-token dump. The token counter is green. The model glides right past it. And you'll never see an error; you'll see a subtly worse answer that looks completely plausible.

Context is a resource you engineer, not a bucket you fill

This reframes the whole "context management" skill. The window isn't storage — it's attention budget, and position is part of the budget. Managing it well is the actual craft now, and it looks like this:

Budget the window. Treat "how much of the window am I using" as a real constraint, and keep the live payload well under the stated max — a third is a good default. Big windows are headroom, not a target.
Bookend what matters. Put the instructions and the highest-value context at the start and the end, where the model actually looks. Never bury the load-bearing sentence in the middle.
Less, but righter. A small, relevant, well-placed context beats a giant one where the signal is drowned. Retrieval didn't die with long context — it got more valuable, because getting the right 1% in front of the model beats dumping 100% into the graveyard.
Mind what else is renting space. Tool definitions, history, boilerplate — they eat the window before your task even starts. Every token you didn't need is pushing your real content toward the part the model skims.

The bottom line

The context window grew, the marketing said "just put everything in," and a lot of products quietly got worse while the dashboards stayed green. Size is headroom, not comprehension. The model reads the ends and skims the middle, and no counter will warn you when your answer degraded.

Stop filling the window and start engineering it: budget it, bookend what matters, and get the right context in — not all of it.

Comments

No comments yet

Be the first to share a thought.