2026年5月17日

The spec is the artifact, not the prompt

When agent behavior lands via spec PRs instead of prompt edits, the team reasons about agents the way it reasons about code. Here's what that looks like in practice and why it works.

本文尚未翻译为您的语言——正在显示英文原文。

The first time I tried to ship a non-trivial agent, the artifact under review was a giant system prompt — six paragraphs, a tool list, a few example dialogues, and a list of "don'ts" that grew every time something broke. Reviewers would comment on the prose. New behaviors landed as prompt edits, sometimes in PRs, sometimes pasted into Slack. There was no single answer to the question what is this agent supposed to do?

That doesn't scale past one agent.

What a spec looks like

A spec is a single source-controlled artifact per agent. It says:

Identity: name, role, what it's responsible for in one sentence.
System prompt: the actual prompt, as rendered. Includes references to canonical instructions that other specs share.
Tool list: every tool the agent can call, with the contract that the MCP server enforces. No magic tools.
Guardrails: things the agent must refuse. Concrete, testable.
Output schema: the shape of the structured output, validated at the boundary. Free-form text only when the user is the consumer.
Evaluations: which scenarios this agent must pass. Public benchmarks plus the holdout suite.

Reviewers approve the spec. The runtime is generated from it.

What changes

Three things, all of them load-bearing.

Reviews are about behavior, not phrasing. "Make the agent more helpful" stops being a useful PR comment. The conversation becomes "this guardrail is missing", or "this tool isn't in the list — should it be?", or "the output schema doesn't cover the empty case". You can actually disagree about something concrete.

Improvements stack. Every prompt tweak that someone shipped in Slack last quarter is now either in the spec or it isn't. If it isn't, it either gets re-derived from first principles or quietly dropped — both are good outcomes.

Onboarding is a directory. A new engineer reads four specs and knows what the system does. They don't read prompt fragments scattered across twelve files.

Where it's harder than it sounds

Specs are not free. Three honest costs:

You need a tool layer. If your agent is calling httpx.get(...) inline, there's no spec discipline to enforce. The spec assumes tools are typed, versioned, and discoverable. MCP fits.
You need a real eval suite. Without one, the spec is just hopeful prose. The eval is what makes the spec a contract.
You need to write the spec. This is the part that feels like bureaucracy until the second agent goes into production.

It pays for itself the first time you have to debug a regression in production by reading something other than chat logs.

暂无评论

登录以参与讨论。

做第一个分享想法的人。