2025-03 → present

Scientific Research Agents — AI Architecture

Designing and leading the implementation of autonomous AI systems for scientific research automation at a US-based biotech client. Custom MCP servers, RAG pipelines over scientific corpora, and a held-out evaluation suite.

Rol: AI Architect & Lead Implementation
Stack: Python · FastAPI · Claude Opus · MCP · pgvector
Período: 2025-03 → present

Este caso de estudio aún no está traducido a tu idioma — mostrando el original en inglés.

The problem

Scientific research is dominated by a small set of repeatable but cognitively expensive workflows: corpus search across paywalled literature, structured data extraction from heterogeneous PDFs, hypothesis generation from prior results, experimental-design critique. Each of those is well within reach of modern LLMs — but only if you build the surrounding architecture correctly.

A naive single-prompt approach falls apart fast: corpora are too large for context windows, the right tools are domain-specific (chemistry-aware literature search is not Google), and any answer that ends up in a research note has to be auditable.

What I architected

Multi-agent orchestration with a hierarchical planner–executor split. A planner agent decomposes a research request into typed sub-tasks; executor agents handle the actual tool calls (literature search, data extraction, hypothesis ranking). The planner stays inside a tight context budget; the executors can fan out without polluting it.

Custom MCP servers as the only tool surface. Three transports in production (Stdio for local dev, SSE for streaming results, Streamable HTTP for the production fleet) — all servers I built end-to-end. Tools are versioned and discoverable, agents never see raw HTTP, and the same MCP contract powers internal evals and live runs.

An evaluation framework that the agent never sees during development. Public agent benchmarks for cross-system comparison plus a larger custom scenario suite drawn from real biology workflows, held out during development so improvements are measured, not declared.

Specification-driven development. Every agent has a written spec — system prompt, tool list, guardrails, expected output schema — checked into the repo. The spec, not the prompt, is the artifact reviewers approve. New behaviors land via spec PRs.

Day-to-day reality

I do not type the implementation. I direct coding agents (Claude Code on Opus) as the implementation layer while owning the architecture, the spec discipline, and the evaluation discipline. Twenty years of production judgment go into deciding what should be built; agent practice gives the velocity to actually ship it. Spec quality is the new bottleneck.

Outcome

A working, evaluated, production-bound multi-agent system that automates research workflows previously done by hand — with measurable improvement on the held-out benchmark set, not just demo-quality screenshots.

The architectural decisions documented here are the basis for everything else I'm building right now.