All notes
AI is brilliant at ideas and bad at being right

June 8, 2026

AI is brilliant at ideas and bad at being right

We worried AI would automate the boring work and leave humans the creative heights. The research from 2026 says we had it backwards. When AI agents were set loose on real research, they generated novel, clearly-written ideas — and then fabricated or invalidated their own experimental results in about 80% of cases. AI turns out to be a fantastic source of ideas and a terrible judge of whether they're true. Once you see that split, how you should use it becomes obvious — and so does the mistake almost everyone is making.

When researchers built a benchmark to test AI agents on real machine-learning research — come up with an idea, design the experiment, run it, write it up — they found a lopsided result. The agents were good at the front of the process: they generated novel ideas and articulated them clearly. Then they hit the part that matters and fell apart. In roughly 80% of cases the coding agents produced fabricated or invalidated experimental results, and the overall research quality landed below an acceptable bar — not because the ideas were dull, but because the work wasn't sound.

Sit with that shape, because it's the opposite of the story we told ourselves. We assumed AI would handle the rote, mechanical parts and humans would keep the creative high ground. The data says AI is genuinely creative and genuinely unreliable. It's great at having ideas and bad at being right. That single fact, taken seriously, should reorganize how you use it.

Two different skills we kept treating as one

There's an old split in how people think: divergent thinking — generating lots of possibilities — and convergent thinking — judging which of them is actually true, valuable, or feasible. We tend to blur them together and call the whole thing "smart." AI forces them apart, because it's strong at one and weak at the other.

On divergence, AI is legitimately impressive. A study led by Yoshua Bengio's group this year found language models can match or beat average humans at generating ideas — it's the most frictionless brainstorm partner ever made. But the same research found AI lacks the evaluative side: it has no real filter for which wild idea is worth anything. It will hand you ten directions with equal confidence and no sense of which one is a dead end. The judgment — the "which of these is actually right" — is exactly what it doesn't have, and exactly what the research benchmark measured it failing at.

Why this is so easy to get wrong

Here's the trap. AI's output is fluent. The fabricated experimental result is written up as cleanly as the valid one. The dead-end idea is articulated as confidently as the brilliant one. Because it presents everything with the same polish, fluency reads as rigor — and it isn't. This is the same illusion behind the sycophancy problem and the "agent declares victory while quietly wrong" problem: the surface is convincing precisely where the substance is weakest.

So the natural mistake is to take AI's confident, well-written output as if it were verified. It isn't verified. It's generated. Those are different things, and AI collapsed only the generating. The clean prose is not evidence the idea is sound; it's evidence the model is good at prose.

The division of labor that actually works

Once you accept "great at ideas, bad at being right," the right way to use AI falls out almost mechanically:

  • Point it at divergence, not decisions. Use AI to widen the space — twenty approaches, angles you hadn't considered, a first draft to react against. That's where it genuinely beats a blank page. Don't ask it to tell you which one is correct; that's the part it can't do.
  • Keep rigor human, and make it explicit. The "is this actually true, does this experiment hold, will this hold up" step is yours. Treat every AI-generated claim as a hypothesis to test, not a finding to trust. The 80% fabrication rate is the cost of skipping that step.
  • Verify against reality, not against the model. A confident answer checked only by asking the model again is still unverified. Run it, test it, look at the source. The judgment has to touch something real.
  • Remember the divergence has a ceiling too. Everyone brainstorming with the same models drifts toward the same ideas — research this year warned AI can make thinking more uniform. Use it to get unstuck, then push past where it stops, because the genuinely original move is still yours.

The bottom line

The headline fear was that AI would take the creative work and leave us the drudgery. The reality is almost the reverse: AI is a tireless idea generator that cannot reliably tell a true idea from a false one, and presents both with the same confident polish. That makes it a superb thinking partner and a dangerous oracle — and which one it is depends entirely on whether you supply the rigor it lacks.

So use it the way you'd use a brilliant, fast, slightly unreliable colleague who's never short of suggestions and never sure which are right: take the ideas gratefully, and verify every one yourself. The creativity is real and worth having. The rigor was always your job — and the research just confirmed that handing it to the model is how you get a beautifully-written wrong answer.

Comments

No comments yet

Sign in to join the conversation.

Be the first to share a thought.