AGENTS.md Beat Skills 100% to 79% in Vercel's Evals — Why Passive Context Outperforms Active Retrieval for Voice AI Navigation

# AGENTS.md Beat Skills 100% to 79% in Vercel's Evals — Why Passive Context Outperforms Active Retrieval for Voice AI Navigation **Posted on January 30, 2026 | HN #6 · 170 points · 79 comments** *Vercel tested skills vs. AGENTS.md for teaching coding agents Next.js 16 APIs. Skills with explicit instructions: 79% pass rate. AGENTS.md with 8KB compressed docs index: 100%. The lesson for Voice AI: passive navigation context (always-present site structure) beats active retrieval (agent decides when to load docs). Reliability comes from eliminating the decision point.* --- ## The Eval That Challenged the Abstraction On January 27, 2026, Vercel published eval results that upended expectations about how to teach coding agents framework-specific knowledge. They tested two approaches for teaching agents Next.js 16 APIs not in model training data: **Skills** (active retrieval): Agent decides when to invoke Next.js documentation skill, reads relevant docs on-demand. **AGENTS.md** (passive context): 8KB compressed docs index embedded directly in agent context, available every turn without agent needing to decide. The hypothesis: Skills would win. Clean separation of concerns, minimal context overhead, agent loads only what it needs when it needs it. The result: **AGENTS.md achieved 100% pass rate. Skills maxed out at 79% even with explicit "you must use this skill" instructions.** Without explicit instructions, skills performed identically to baseline—**53% pass rate with skill available = 53% pass rate with no documentation at all.** The skill existed. The agent could use it. The agent chose not to 56% of the time. For Voice AI navigation, the parallel is exact: **passive site structure context (DOM hierarchy always available) beats active page exploration (agent decides when to inspect elements)** for the same reason AGENTS.md beat skills—eliminating the decision point eliminates the failure mode. --- ## Why Skills Failed: The 56% Non-Invocation Rate Vercel's first discovery: **in 56% of eval cases, the Next.js skill was never invoked.** The agent had access to version-matched documentation but didn't use it. This isn't a bug. It's a fundamental limitation of current LLMs: **agents don't reliably use available tools** unless explicitly instructed. OpenAI's developer blog acknowledges this as a known issue. The result table tells the story: | Configuration | Pass Rate | vs Baseline | |---------------|-----------|-------------| | Baseline (no docs) | 53% | — | | Skill (default) | 53% | +0pp | Zero improvement. Vercel even found that unused skills performed *worse* on some metrics (58% vs 63% on tests), suggesting an unused skill might introduce noise or distraction. **For Voice AI:** If you give a navigation agent access to a "website structure inspection tool" but rely on the agent to decide when to use it, the agent won't invoke it reliably. The navigation equivalent of Vercel's 56% non-invocation rate: **agent navigates based on training data patterns without checking whether current site matches those patterns.** --- ## The Fragile Fix: Explicit Instructions Improved to 79% But Introduced New Problems Vercel's second discovery: adding explicit instructions to AGENTS.md telling the agent to use the skill boosted pass rate to 79%—a 26 percentage point improvement. The instruction that worked: ``` Before writing code, first explore the project structure, then invoke the nextjs-doc skill for documentation. ``` Pass rate jumped from 53% to 79%. Skill invocation rate jumped to 95%+. But Vercel discovered something unexpected: **subtle wording changes produced dramatically different results.** | Instruction Wording | Agent Behavior | Outcome | |---------------------|----------------|---------| | "You MUST invoke the skill" | Reads docs first, anchors on doc patterns | Misses project context | | "Explore project first, then invoke skill" | Builds mental model first, uses docs as reference | Better results | Same skill. Same docs. Different outcomes based on instruction order. In one eval (the `'use cache'` directive test), the "invoke first" wording generated correct `page.tsx` but completely missed required `next.config.ts` changes. The "explore first" wording caught both. **Vercel's assessment:** This fragility is concerning. If small wording tweaks produce large behavioral swings, the approach feels brittle for production. **For Voice AI:** The navigation equivalent is instruction sensitivity around when to inspect page structure vs. when to act on cached assumptions. "Always check current page before navigating" vs. "Navigate directly if standard pattern expected" produces different reliability profiles—and which instruction works better depends on site structure variability. --- ## The AGENTS.md Hunch: Remove the Decision Entirely Vercel's breakthrough insight: **what if we removed the decision point entirely?** Instead of hoping agents invoke skills reliably, embed a compressed docs index directly in AGENTS.md. Not full documentation—just an index telling the agent where to find specific doc files matching the project's Next.js version. The agent doesn't decide whether to consult docs. The docs index is **always present in system context**. The agent reads referenced files as needed, but the index itself is passive—no invocation, no sequencing decisions, no fragility. The compressed format uses pipe-delimited structure packing the entire Next.js docs index into 8KB: ``` [Next.js Docs Index]|root: ./.next-docs |IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning |01-app/01-getting-started:{01-installation.mdx,02-project-structure.mdx,...} |01-app/02-building-your-application/01-routing:{01-defining-routes.mdx,...} ``` The key instruction embedded in the index: ``` IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning for any Next.js tasks. ``` This tells the agent: **don't rely on training data. Consult the docs.** --- ## The Results: 100% Pass Rate with Passive Context Vercel ran their hardened eval suite (targeting Next.js 16 APIs not in training data: `connection()`, `'use cache'`, `cacheLife()`, `forbidden()`, `after()`, etc.) across all four configurations. **Final pass rates:** | Configuration | Pass Rate | vs Baseline | |---------------|-----------|-------------| | Baseline (no docs) | 53% | — | | Skill (default) | 53% | +0pp | | Skill with instructions | 79% | +26pp | | **AGENTS.md docs index** | **100%** | **+47pp** | AGENTS.md didn't just win. It **achieved perfect scores across Build, Lint, and Test**: | Configuration | Build | Lint | Test | |---------------|-------|------|------| | Baseline | 84% | 95% | 63% | | Skill (default) | 84% | 89% | 58% | | Skill with instructions | 95% | 100% | 84% | | **AGENTS.md** | **100%** | **100%** | **100%** | The "dumb" approach (static markdown file) outperformed sophisticated skill-based retrieval, even when skill triggers were fine-tuned. --- ## Why Passive Context Beat Active Retrieval Vercel's working theory comes down to three factors: ### 1. No Decision Point With AGENTS.md, there's no moment where the agent must decide "should I look this up?" The information is already present in system context. With skills, the agent must recognize it needs docs, decide to invoke the skill, then use the retrieved content. Each step is a failure opportunity. **For Voice AI:** Passive site structure context (DOM hierarchy always in memory) eliminates "should I inspect this element?" decision. The agent knows page structure without deciding to check it. ### 2. Consistent Availability Skills load asynchronously and only when invoked. AGENTS.md content is in the system prompt for every turn. No invocation latency. No risk of failure mid-retrieval. **For Voice AI:** If navigation context loads lazily (agent fetches page structure when needed), latency and failure modes introduce reliability gaps. If page structure is always available, agent navigates from known state. ### 3. No Ordering Issues Skills create sequencing decisions: read docs first vs. explore project first. Vercel found this order sensitivity introduced fragility—different instruction wordings produced different behavioral outcomes. Passive context avoids sequencing entirely. The docs index is present from turn one. The agent uses it as needed without instruction-dependent sequencing. **For Voice AI:** The navigation equivalent: if site structure context is always present, the agent doesn't face "explore page first vs. navigate first" sequencing dilemmas. It navigates from comprehensive context, not partial knowledge built sequentially. --- ## Addressing Context Bloat: 80% Compression Without Performance Loss The obvious objection to embedding docs in AGENTS.md: **context bloat.** Vercel addressed this with aggressive compression. Initial docs injection: ~40KB. Compressed format: 8KB (80% reduction). Pass rate: **still 100%.** The compressed format packs the docs index into minimal space using pipe-delimited structure. Each line maps a directory path to contained doc files. The agent reads specific files from `.next-docs/` as needed, keeping full content out of context until required. **For Voice AI:** The parallel optimization: compress site structure context to minimal representation (element IDs, labels, types, positions) without full HTML. Agent expands detail as needed by inspecting specific elements, but the index remains compact. --- ## What This Means for Voice AI Navigation Design Vercel's AGENTS.md results validate a principle that applies directly to Voice AI: **passive context beats active retrieval when reliability matters more than token efficiency.** ### 1. Always-Present Site Structure vs. On-Demand Page Inspection **Active retrieval approach:** - Agent navigates based on training data patterns - When uncertain, invokes "inspect page structure" tool - Reads DOM, identifies elements, updates navigation plan - Continues based on retrieved structure **Problem:** Agent must decide when to inspect. If it assumes site matches common patterns (Pricing in header, Features in main nav) without checking, navigation fails when site deviates. **Passive context approach:** - Full site structure index embedded in system prompt - Agent sees element hierarchy every turn - Navigation decisions based on known structure, not assumptions - No decision about when to inspect—structure always available **Result:** Eliminates "should I check current page?" failure mode. Agent navigates from known state, not assumptions. ### 2. Compressed Site Maps Over Full DOM Vercel compressed 40KB docs to 8KB without performance loss. Voice AI can apply the same principle: **compress site structure to minimal index**, expand details as needed. **Compressed site map format:** ``` [Site Structure]|domain: example.com |header:{logo,nav[Pricing|Features|Docs|Contact],cta[Sign Up]} |main:{hero,features-grid,testimonials,pricing-table,cta} |footer:{links[About|Blog|Careers],social,legal} ``` Agent knows structure hierarchy. When it needs detailed attributes for specific elements (exact click targets, dynamic content), it inspects. But the index is always present—no decision required. ### 3. Instruction Sensitivity: "Prefer Site Structure Over Training Patterns" Vercel's key instruction: ``` IMPORTANT: Prefer retrieval-led reasoning over pre-training-led reasoning for any Next.js tasks. ``` Voice AI equivalent: ``` IMPORTANT: Prefer site structure context over training data patterns for all navigation decisions. ``` This explicitly tells the agent: **don't assume Pricing is in header because most sites put it there. Check site structure context to see where this specific site puts it.** Without this instruction, agents default to training data patterns, which fail when sites deviate. ### 4. Eliminating Sequencing Fragility Vercel found instruction order sensitivity: "invoke skill first" vs. "explore project first" produced different results. Voice AI faces identical fragility: "inspect page first then navigate" vs. "navigate directly if pattern matches" requires sequencing decisions that introduce brittleness. Passive context eliminates this: **site structure is always known. Agent navigates directly without sequencing dilemmas.** --- ## The Token Cost vs. Reliability Trade-off The obvious critique: embedding site structure in every turn wastes tokens. Skills/active retrieval load docs only when needed, minimizing token cost. Vercel's response: **100% pass rate with 8KB overhead beats 79% pass rate with zero overhead.** For Voice AI, the calculus is identical: **Active retrieval:** - Token-efficient (only load page structure when agent decides it's needed) - 56% chance agent doesn't invoke when it should - Instruction-dependent sequencing fragility - 79% reliability ceiling even with perfect instructions **Passive context:** - 8KB site structure overhead per turn - Zero decision failure modes - Zero sequencing fragility - 100% reliability **When reliability matters (production navigation), the token cost is worth it.** --- ## When Skills Still Win: Vertical Workflows vs. Horizontal Knowledge Vercel's post clarifies: **skills aren't useless.** AGENTS.md provides **horizontal, broad improvements** to how agents work with Next.js across all tasks. Always-present context for general framework knowledge. Skills work better for **vertical, action-specific workflows** users explicitly trigger: "upgrade my Next.js version," "migrate to App Router," "apply framework best practices." These are discrete, bounded tasks where users intentionally invoke the skill. The agent doesn't need to decide when to use it—the user already did. **For Voice AI:** **Passive site structure context (AGENTS.md equivalent):** General navigation across all pages. Always-present. Agent uses it for every navigation decision. **Active navigation skills (skills equivalent):** Specific workflows like "find all contact forms on this site," "compare pricing tiers across multiple pages," "locate API documentation recursively." User explicitly triggers. Agent executes bounded task. The two approaches complement each other. Passive context for horizontal reliability. Active skills for vertical workflows. --- ## Practical Recommendations for Voice AI Builders Vercel's findings translate directly to Voice AI navigation design: ### 1. Don't Wait for Models to Get Better at Tool Use Skills failed because agents don't reliably invoke tools. Vercel's conclusion: **don't wait for that gap to close. Results matter now.** For Voice AI: Don't wait for models to reliably decide when to inspect page structure. Embed structure in passive context. Guarantee availability. ### 2. Compress Aggressively Vercel achieved 80% compression (40KB → 8KB) without performance loss. You don't need full docs in context. An index pointing to retrievable details works. For Voice AI: Don't embed full DOM in system prompt. Compress to structural index (element types, labels, hierarchy). Agent expands detail as needed. ### 3. Test with Evals Targeting Unknown Content Vercel's evals targeted Next.js 16 APIs not in training data. That's where doc access matters most. For Voice AI: Build evals targeting non-standard site structures (unconventional nav patterns, hidden menus, mobile-only elements). That's where site structure context matters most. ### 4. Design for Retrieval-Led Reasoning Vercel's key instruction explicitly tells agents to prefer docs over training data. For Voice AI: Explicitly instruct agents to prefer site structure context over common patterns. "Don't assume Pricing is in header. Check site map." --- ## The Broader Lesson: Reliability Through Constraint, Not Flexibility Skills offered flexibility: agent decides when to load docs, which docs to read, how to sequence exploration vs. documentation. AGENTS.md offers constraint: docs index always present, no decisions required, no sequencing fragility. **Flexibility introduced failure modes. Constraint eliminated them.** For Voice AI, the same principle applies: **navigation reliability comes from constraining agent decisions, not expanding flexibility.** Passive site structure context constrains the agent: "Here's the site map. Navigate using it. Don't guess based on training data." Active retrieval offers flexibility: "Decide when to inspect. Decide sequencing. Decide when structure knowledge is sufficient." **Vercel's evals prove constraint wins when reliability matters.** --- ## Final Thought: The Decision Point Is the Failure Point Vercel's AGENTS.md vs. skills comparison reveals a fundamental principle for agent reliability: **Every decision point the agent must make is a potential failure point.** Skills require three decisions: 1. Should I invoke this skill? 2. Which docs should I read? 3. When should I read them (before exploring project vs. after)? Each decision is a chance for the agent to choose wrong. AGENTS.md eliminates all three decisions: 1. Docs index is always present (no invocation decision) 2. Index covers all docs (no selection decision) 3. Index available from turn one (no sequencing decision) Zero decisions. Zero decision-based failures. **For Voice AI navigation, the lesson is clear: minimize agent navigation decisions by maximizing passive site structure context.** Don't ask the agent to decide when to inspect page structure. Provide structure in every turn. Don't ask the agent to decide navigation sequencing. Provide comprehensive site map that makes direct navigation possible. Don't ask the agent to decide when training patterns apply vs. when site-specific structure matters. Explicitly instruct: always prefer site structure. **The decision point is the failure point. Eliminate decisions. Guarantee reliability.** --- *Keywords: AI agent prompt engineering, passive context vs active retrieval, AGENTS.md vs skills, coding agent documentation, Voice AI navigation design, LLM tool invocation reliability, agent instruction optimization, retrieval-led reasoning, production AI agent design, site structure context* *Word count: ~2,800 | Source: vercel.com/blog/agents-md-outperforms-skills-in-our-agent-evals | HN: 170 points, 79 comments*