METHODOLOGY · June 7, 2026

You can't run an agent you can't watch

A Cisco survey this year found most companies are running agents they can't properly monitor. That's the whole problem in one sentence. Agents fail in a way regular software doesn't — they return a tidy success while quietly doing the wrong thing, and you only see it in the full trace of what they did, not the final output. 'Agent observability' became its own discipline in 2026 for exactly that reason. The unglamorous ability to watch what your agent actually did is turning into the line between a pilot and production.

Here's a quietly alarming finding: a Cisco survey this year reported that 71% of organizations are running AI agents they can't properly monitor. Read that as what it is — most teams deploying agents have no reliable way to see what those agents are doing. They launched something autonomous into their business and then closed their eyes.

It sounds careless, but it's an easy trap, because agents break the assumptions that normal monitoring is built on. And the industry's response — a whole new category called agent observability that barely existed a year ago — tells you how real the problem is. This is worth understanding even if you'll never buy a tool for it, because the principle underneath is simple: you cannot run what you cannot see.

Agents fail in a way that looks like success

Normal software fails loudly. It throws an error, returns a 500, crashes. Your monitoring is built to catch exactly that. Agents don't do you the favor.

An agent fails in ways that look like success: a well-formed answer that's wrong, a tool call it didn't need, an action that's syntactically valid and semantically nonsense. It returns a clean HTTP 200 and a confident result while having done the wrong thing. This is the same failure I wrote about in agents declaring victory against the wrong target — and from the outside, that failure is invisible. Nothing errored. The dashboard is green. The agent quietly mishandled the case and moved on.

That's why traditional monitoring doesn't save you here. Tracking response codes and latency tells you the agent ran. It tells you nothing about whether it did the right thing — and "ran successfully while being wrong" is the agent's signature failure mode.

The failure lives in the trace, not the output

There's a second reason agents are hard to watch. Their mistakes are usually not in any single step — they're in the sequence. An agent reads, decides, calls a tool, reads the result, decides again, calls another. Each individual call can look perfectly fine while the overall path quietly goes off the rails. As one guide puts it, multi-turn failures are invisible at the individual call level and only show up in the full causal trace.

So observing an agent doesn't mean logging its final answer. It means capturing the whole chain — every model call, every tool execution, every reasoning step — as a trace you can replay and follow. The difference is stark in an incident: teams with that trace can answer "why did it do that" in minutes; the uninstrumented majority can only shrug and re-run it, hoping it behaves differently this time. As agents move into work that touches money and customers, that gap stops being a nice-to-have.

This is the unglamorous half of "watch, don't approve"

I've argued that the job shifts from approving every step to watching the system — setting policy and intervening when something looks wrong. The uncomfortable follow-up is: you can only watch what you've instrumented. "I'll monitor it" is an empty promise if you have no trace to monitor. The whole watch-don't-approve model quietly assumes an observability layer that most teams skipped building.

That's exactly the gap the enterprise vendors are rushing to fill. Hyland's new Control Tower is pitched as a command center that tracks agents against KPIs and can pause or adjust one in real time when it crosses a guardrail — and its Agent Lifecycle Management frames an agent as something you manage from design through retirement, not something you launch and forget. Strip away the enterprise packaging and it's the same lesson: scaling agents without oversight isn't scaling, it's gambling.

What to actually do

You don't need a platform to take the principle seriously. Even on a solo project:

Trace the whole session, not just the result. Log every tool call and decision in order, so when something goes wrong you can replay the path instead of guessing. The final output is the least informative thing to keep.
Watch for "succeeded but wrong," not just errors. Your alerts should catch semantic failure — the agent that returned 200 and did the wrong thing — which means evaluating outputs against criteria, not just checking that it ran.
Treat the agent as something with a lifecycle. It will drift as models update and the world changes; most of that drift happens silently after the demo. Re-check it on a schedule the way you'd review any system that can quietly rot.
If you can't see it, don't let it act unsupervised. The honest rule: an agent you can't observe is an agent you shouldn't have doing anything consequential on its own.

The bottom line

The exciting story about agents is autonomy — they go do things without you. The unglamorous truth is that autonomy without observability isn't independence, it's just blindness. An agent that fails like normal software, you'd catch. An agent that fails by handing you a confident, green-checkmarked wrong answer, you'll only catch if you built the ability to look.

So before you let an agent run anything that matters, ask the plain question: if it did the wrong thing right now, would I even know? If the answer is no, you don't have a production agent. You have an unmonitored one — and that's just an incident that hasn't been noticed yet.

Comments

No comments yet

Be the first to share a thought.