Building Coding Agents Taught Me Why Voice AI Navigation Needs Four Tools, Not Forty — Lessons in Minimal Agent Architecture

# Building Coding Agents Taught Me Why Voice AI Navigation Needs Four Tools, Not Forty — Lessons in Minimal Agent Architecture **Posted on February 1, 2026 | HN #2 · 146 points · 48 comments** *Mario Zechner built `pi`, a minimal coding agent with 4 tools (read/write/edit/bash), a <1,000 token system prompt, and no MCP bloat—then benchmarked it against Cursor, Windsurf, and Codex using Claude Opus 4.5. His result: competitive performance despite rejecting 80% of features other agents offer (sub-agents, plan mode, background bash, built-in to-dos). The lesson for Voice AI navigation: tool minimalism isn't a constraint—it's a strategy. Four navigation primitives (click/scroll/read/navigate) can match the capability of massive tool catalogs if you design for progressive disclosure instead of upfront context dumping. When MCP servers burn 13-18K tokens declaring tools you'll never use, Voice AI should take the opposite bet: small tool surface, big model intelligence.* --- ## The Coding Agent Arms Race Hit a Wall By November 2025, coding agents had become spaceships. Claude Code added plan mode, sub-agents, background bash, MCP support, todo lists. Cursor shipped multi-agent swarms. Windsurf added AI-powered debugging orchestration. Codex integrated context-aware refactoring. Mario Zechner, who'd used Claude Code since April 2025, watched this evolution with frustration: > "Over the past few months, Claude Code has turned into a spaceship with 80% of functionality I have no use for. The system prompt and tools also change on every release, which breaks my workflows and changes model behavior. I hate that. Also, it flickers." The feature bloat created three problems: ### 1. Context Pollution **Example: MCP Server Overhead** - Playwright MCP: 21 tools, 13.7K tokens - Chrome DevTools MCP: 26 tools, 18K tokens - Combined: **7-9% of context window consumed before user even starts** Most of those tools? Never used in a given session. But they're declared upfront because that's how MCP works. ### 2. Hidden Behavior Claude Code's sub-agent orchestration: - Main agent spawns sub-agent for complex tasks - Zero visibility into sub-agent's tool calls or reasoning - Context transfer controlled by orchestrator (you can't see what context gets passed) - Debugging failures = impossible (black box within black box) ### 3. Unstable Workflows System prompt changes between releases: - Tool definitions shift (same tool, different schema) - Model behavior changes (same task, different approach) - User workflows break (assumptions invalidated) Mario's solution: **Build the opposite.** A minimal coding agent with full observability and stable interfaces. --- ## The `pi` Philosophy: If You Don't Need It, Don't Build It Mario built four components: 1. **`pi-ai`**: Unified LLM API supporting Anthropic/OpenAI/Google/xAI/Groq/Cerebras/OpenRouter/self-hosted models 2. **`pi-agent-core`**: Agent loop handling tool execution and event streaming 3. **`pi-tui`**: Terminal UI framework with differential rendering (flicker-free updates) 4. **`pi-coding-agent`**: CLI wiring it all together with session management and project context files **Design principle:** "If I don't need it, it won't be built. And I don't need a lot of things." ### The Minimal Toolset: Four Tools, <1,000 Tokens Here's the complete tool catalog: ``` read Read file contents (supports text + images) Args: path, offset, limit write Write content to file (creates if missing, overwrites if exists) Args: path, content edit Replace exact text (surgical edits, must match exactly) Args: path, oldText, newText bash Execute bash command in current directory Args: command, timeout ``` That's it. No separate `grep`, `find`, `ls` tools. Models know how to use `bash` for those operations. **Comparison:** - **Claude Code:** 15+ tools (read, write, multi-edit, glob, grep, bash, background bash, browser, sub-agent spawning, etc.) - **Codex:** Similarly minimal (4-5 core tools) - **opencode:** Copied Claude Code's toolset (15+ tools) - **pi:** 4 tools **System prompt size:** - **Claude Code:** ~10,000 tokens (includes extensive tool documentation, examples, git workflows, permission checks) - **Codex:** ~5,000 tokens (model-specific prompts with detailed instructions) - **opencode:** ~8,000 tokens (cut-down Claude Code prompt) - **pi:** <1,000 tokens (minimal guidelines + AGENTS.md file) --- ## What `pi` Refuses to Build — And Why That Matters for Voice AI Mario deliberately excluded features other agents shipped: ### 1. No Built-In To-Dos **Why other agents have it:** Track multi-step tasks, show progress, maintain state across turns. **Why pi rejects it:** > "To-do lists generally confuse models more than they help. They add state that the model has to track and update, which introduces more opportunities for things to go wrong." **pi's alternative:** If you need task tracking, write a `TODO.md` file. Agent reads it, updates checkboxes, tracks progress. Externally stateful, version-controlled, visible to user. **Voice AI parallel:** Navigation doesn't need built-in "steps remaining" tracking. If user asks "What's left to find?", Voice AI can read page structure and answer. No internal state needed. ### 2. No Plan Mode **Why other agents have it:** Separate "planning phase" where agent analyzes codebase without modifying files. **Why pi rejects it:** > "Telling the agent to think through a problem together with you, without modifying files or executing commands, is generally sufficient." **pi's alternative:** If persistent planning needed, write a `PLAN.md` file. Agent updates it as work progresses. Survives across sessions, version-controlled. **Claude Code's plan mode** (added late 2025): Spawns read-only sub-agent, generates markdown plan file, requires approving dozens of read commands for proper analysis. **pi's approach:** Same outcome (markdown plan), no sub-agent orchestration, full observability (you see every file the agent reads). **Voice AI parallel:** Users don't need "navigation plan mode." If Voice AI is uncertain, it can ask: "I see Pricing in both the header and footer. Which would you like?" Progressive clarification beats upfront planning. ### 3. No MCP Support **Why other agents have it:** Access to third-party tool ecosystems (Playwright, Chrome DevTools, Slack, GitHub, etc.). **Why pi rejects it:** **Example MCP overhead:** - Playwright MCP: 21 tools declared upfront - Voice navigation needs: 2-3 Playwright commands (click, type, wait) - Token waste: 13.7K tokens for 18 unused tool definitions **pi's alternative:** Build CLI tools with README files. Agent reads README when needed (progressive disclosure), invokes tool via `bash`. Composable (pipe outputs), token-efficient (pay cost only when used). **Voice AI parallel:** MCP approach = declare every possible navigation action upfront (click button, click link, scroll down, scroll up, scroll to element, type in field, select dropdown, submit form, etc.). Minimal approach = 4 primitives (click, scroll, read, navigate), compose as needed. ### 4. No Background Bash **Why other agents have it:** Run dev servers, long-running tests, REPL sessions while agent continues working. **Why pi rejects it:** > "Background process management adds complexity: process tracking, output buffering, cleanup on exit, ways to send input to running processes." **Claude Code's implementation:** - Agent can start background processes - Poor observability (you don't see full output stream) - Agent forgets processes after context compaction (had to manually kill orphaned processes in early versions) **pi's alternative:** Use `tmux`. Agent spawns tmux sessions, interacts with them via bash, full observability (you can attach to same session and co-work with agent). **Voice AI parallel:** Don't need "background navigation" where Voice AI monitors multiple pages simultaneously. Navigation is inherently sequential (one page at a time). If monitoring needed, browser tabs already provide that abstraction. ### 5. No Sub-Agents **Why other agents have it:** Delegate complex tasks to specialist sub-agents (code review, refactoring, testing). **Why pi rejects it:** > "You have zero visibility into what that sub-agent does. It's a black box within a black box. Context transfer between agents is also poor." **pi's alternative when sub-agents ARE needed:** - Agent spawns itself via `bash`: `pi --print --provider anthropic --model claude-opus-4.5 < code_review_prompt.md` - Full observability on output - Can save sub-agent session to file and review it later **Voice AI parallel:** Don't need "navigation sub-agents." If Voice AI needs to check multiple pages for information (e.g., "Compare pricing across competitors"), it can navigate sequentially and aggregate results. No orchestration needed. --- ## The Terminal-Bench Proof: Minimal Beats Bloated Mario benchmarked `pi` against established coding agents using Terminal-Bench 2.0: **Competitors:** - Cursor (with Sonnet) - Windsurf (with proprietary Cascade model) - Codex (with GPT-4) - Claude Code (with Opus) **pi configuration:** - 4 tools (read/write/edit/bash) - <1,000 token system prompt - Claude Opus 4.5 **Results (5 trials per task):** | Agent | Model | Pass Rate | Error Rate | |-------|-------|-----------|------------| | **pi** | **Opus 4.5** | **Competitive** | **Similar to native harnesses** | | Cursor | Sonnet | Baseline | Baseline | | Windsurf | Cascade | Baseline | Baseline | | Codex | GPT-4 | Baseline | Baseline | (Full leaderboard position: mid-tier, outperforming some established agents despite minimal toolset) **Key insight from Terminal-Bench team:** Their own minimal agent, **Terminus 2**, just gives the model a `tmux` session. No file tools, no fancy abstractions. Model sends commands as text, parses terminal output itself. **Holds its own against agents with sophisticated tooling.** **Mario's conclusion:** > "More evidence that a minimal approach can do just as well." --- ## The Four Principles of Minimal Agent Architecture From Mario's coding agent experience, four design principles emerge: ### 1. Progressive Disclosure > Upfront Declaration **MCP approach:** - Declare all tools upfront (21 tools = 13.7K tokens) - Model sees every tool on every turn - 90% of declarations wasted (tools never used) **Minimal approach:** - Declare 4 primitive tools (<500 tokens) - Agent composes primitives to solve problems - If specialized tool needed, agent reads its README (progressive disclosure) **Voice AI application:** - Don't declare: click_button, click_link, click_image, click_div, etc. (NxM combinations) - Declare: `click(selector)` (one primitive, infinite targets) - Model intelligence fills the gap ### 2. Observability > Orchestration **Sub-agent approach (Claude Code):** - Main agent spawns sub-agent - User sees: "Running code review sub-agent..." - Black box: no visibility into sub-agent's reasoning or tool calls - Debugging: impossible (can't see what went wrong) **Observable approach (pi):** - Agent spawns itself via `bash` with prompt - User sees: full command, full output, can save session for review - Transparency: every decision visible - Debugging: trivial (inspect sub-agent session file) **Voice AI application:** - Don't hide navigation reasoning ("Finding pricing page...") - Show: "I see 'Pricing' in header and 'View Pricing' button in hero. Clicking header link." - User understands decisions, can correct immediately ### 3. External State > Internal State **Built-in to-do approach:** - Agent maintains todo list in memory - State invisible to user (unless agent reports it) - Can't version control, can't edit directly, disappears after session **File-based approach:** - Agent writes `TODO.md` file - User sees state in editor - Can manually edit, version control, reference across sessions **Voice AI application:** - Don't track "user preferences" internally (invisible state) - If user says "Always skip login prompts," write preference to config file - Agent reads config on next session, user can edit config.json directly ### 4. Composition > Specialization **Specialized tool approach:** - `read_file`, `read_directory`, `search_file`, `grep_file` (4 separate tools) - Each needs definition, examples, schemas - Model must choose correct tool for task **Composable primitive approach:** - `bash` (one tool) - Agent composes: `grep "pattern" file.txt`, `ls directory/`, `cat file.txt` - Model already knows bash syntax (trained on GitHub/Stack Overflow) **Voice AI application:** - Don't build: `navigate_to_pricing`, `navigate_to_features`, `navigate_to_docs` (specialized) - Build: `navigate(intent)` (general) - Model figures out path: intent="pricing" → finds "Pricing" link → clicks --- ## What Voice AI Can Learn from `pi`'s Minimalism ### Lesson 1: Four Navigation Primitives Are Enough **Coding agents need:** read, write, edit, bash **Voice AI navigation needs:** click, scroll, read, navigate **Why four primitives work:** **`click(selector)`:** - Handles buttons, links, tabs, accordions, modals - Model intelligence determines what to click based on intent - No need for separate `click_button`, `click_link`, etc. **`scroll(direction, amount)`:** - Handles pagination, infinite scroll, "show more" patterns - Direction: up/down/to-element - Composable: scroll down, read, scroll down, read (pagination) **`read(selector?)`:** - Extracts text from page or specific element - Voice AI uses this to understand page structure before navigating - Equivalent to coding agent's `read` (understand before acting) **`navigate(url_or_intent)`:** - Direct URL: navigate("https://example.com/pricing") - Intent-based: navigate("pricing") → model finds pricing page - Equivalent to coding agent's `bash cd` (change context) **Comparison to bloated approach:** Imagine Voice AI with MCP-style toolset: - `click_button(text)` - `click_link(href)` - `click_image(alt)` - `click_nav_item(text)` - `click_dropdown_option(value)` - `click_tab(index)` - `click_accordion_header(text)` - `scroll_to_top()` - `scroll_to_bottom()` - `scroll_down_one_page()` - `scroll_to_element(selector)` - `read_header()` - `read_body()` - `read_footer()` - `read_element(selector)` - `navigate_to_url(url)` - `navigate_to_page_by_name(name)` - `go_back()` - `go_forward()` - `refresh()` **Token cost:** ~15K tokens upfront **Actual usage:** 90% of tasks use `click`, `scroll`, `read` **Minimal approach:** 4 tools, ~800 tokens, same capability ### Lesson 2: Context Engineering > Feature Addition Mario's complaint about Claude Code: > "Context engineering is paramount. Exactly controlling what goes into the model's context yields better outputs, especially when it's writing code. Existing harnesses make this extremely hard or impossible by injecting stuff behind your back that isn't even surfaced in the UI." **Voice AI equivalent:** **Bad approach (hidden context injection):** - Silently inject page structure analysis into every turn - Silently inject site map data - Silently inject user interaction history - User has no visibility, can't control what agent sees **Good approach (explicit context):** - User sees page structure analysis: "Analyzing page... found 3 navigation menus, 12 links, 2 CTAs" - Agent explains reasoning: "I see 'Pricing' in both header and sidebar. Clicking header link (more prominent)." - User can correct: "No, use sidebar" → agent learns to prefer sidebar on this site ### Lesson 3: Model Intelligence > Tool Proliferation Terminal-Bench's **Terminus 2** agent: - No file tools - Just `tmux` session - Model sends bash commands, parses terminal output **Performance:** Competitive with agents having sophisticated file operation toolsets **Reason:** Frontier models (Opus 4.5, GPT-4) have been trained on vast amounts of code, bash scripts, terminal sessions. They **already know** how to navigate filesystems, parse outputs, compose commands. **Voice AI parallel:** Frontier models have been trained on: - Web scraping tutorials (BeautifulSoup, Selenium) - Browser automation scripts (Playwright, Puppeteer) - HTML/CSS/DOM documentation - User navigation patterns (heatmaps, analytics) They **already understand** how to navigate websites. Give them 4 primitives + page structure, model intelligence fills the gap. **Don't need:** - 21 specialized navigation tools - Extensive examples for every edge case - Hand-holding through every navigation pattern **Need:** - Clean page structure representation (DOM summary or accessibility tree) - 4 composable primitives - Trust model to figure it out ### Lesson 4: YOLO Mode for Bounded Domains Mario on security theater: > "As soon as your agent can write code and run code, it's pretty much game over. The only way you could prevent exfiltration of data would be to cut off all network access for the execution environment the agent runs in, which makes the agent mostly useless." **pi's approach:** No permission prompts, no pre-checking bash commands for malicious content, full YOLO mode. "Everybody is running in YOLO mode anyways to get any productive work done, so why not make it the default and only option?" **Voice AI equivalent:** **Bad approach (security theater):** - Prompt user: "Voice AI wants to click 'Buy Now' button. Allow?" (every single click) - Pre-check navigation intent for malicious patterns ("Don't let agent submit forms") - Rate limit clicks (max 10/minute) **Bounded YOLO approach:** - Voice AI can navigate freely **within current site** - Cross-site navigation requires confirmation - Form submission (checkout, login) requires explicit user approval - Otherwise: full autonomy **Reasoning:** Navigation within a single site is bounded domain (can't exfiltrate data, can't execute code, can't modify server state unless user approves form submission). Permission prompts for every click destroy UX. --- ## The Architecture Contrast: Minimal vs. Maximal ### Maximal Approach (Claude Code, circa late 2025) **System prompt:** ~10,000 tokens - Detailed tool documentation with examples - Git workflow instructions - Permission checking guidelines - Context management rules - Sub-agent orchestration protocols - Plan mode activation conditions **Tools:** 15+ - File operations: read, write, multi-edit, glob, grep - Execution: bash, background bash - Navigation: browser (if enabled) - Orchestration: sub-agent spawning, task delegation - State: built-in to-dos, plan mode **Context overhead per turn:** - System prompt: 10K tokens - Tool declarations: 5K tokens - MCP integrations: 15K tokens (if enabled) - **Total baseline:** ~30K tokens before user even starts **Philosophy:** Provide every tool agent might need, anticipate every workflow ### Minimal Approach (pi, Mario's agent) **System prompt:** <1,000 tokens - Tool names + brief descriptions - 4 guidelines (use bash for file ops, use read before edit, be concise, show file paths) - AGENTS.md injection point (user-controlled context) **Tools:** 4 - read, write, edit, bash **Context overhead per turn:** - System prompt: 1K tokens - Tool declarations: 500 tokens - **Total baseline:** ~1.5K tokens **Philosophy:** Provide minimal primitives, trust model intelligence to compose solutions ### Voice AI Minimal Architecture **System prompt:** <1,000 tokens ``` You are a Voice AI navigation assistant. You help users find information on websites through voice commands. Available tools: - click(selector): Click element (buttons, links, tabs, etc.) - scroll(direction, amount): Scroll page (up/down/to-element) - read(selector?): Extract text from page or element - navigate(url_or_intent): Go to URL or find page by intent Guidelines: - Use read to understand page structure before navigating - Explain uncertain decisions: "I see two 'Pricing' links. Which one?" - Show what you're clicking: "Clicking 'Enterprise Pricing' in header" - If stuck, describe what you see: "I found a Contact form but no direct pricing page" [User's NAVIGATION_PREFERENCES.md injected here] ``` **Tools:** 4 - click, scroll, read, navigate **Context overhead:** ~1.5K tokens (20x more efficient than MCP bloat) **Page structure representation:** Accessibility tree (already implemented in browsers, ~2-5K tokens per page) **Total context per turn:** ~3-6K tokens (system + tools + page structure) --- ## The Benchmark That Validates Minimalism Mario's Terminal-Bench results prove minimal agents can compete with maximal ones when model quality is held constant (all use Claude Opus 4.5 or equivalent frontier models). **What changes performance:** 1. **Model quality** (Opus > Sonnet > Haiku) 2. **Context clarity** (clean system prompt > bloated prompt) 3. **Tool composability** (4 primitives > 15 specialized tools) **What doesn't significantly improve performance:** 1. More tools (diminishing returns after primitives) 2. Longer system prompts (confuses model if poorly structured) 3. Sub-agent orchestration (black box failures) 4. Built-in state tracking (external files work better) **Voice AI implication:** Demogod's DOM-aware Voice AI with 4 navigation primitives can match (or exceed) competitors with massive tool catalogs IF: 1. Model quality high (Claude Opus 4.5, GPT-4, Gemini 2.0) 2. Page structure representation clean (accessibility tree, not raw HTML) 3. Tools composable (model figures out multi-step navigation) **You don't need:** - 50 specialized navigation tools - 20K token system prompt explaining every edge case - Sub-agents for "complex navigation tasks" **You need:** - 4 primitives - Clean page representation - Frontier model intelligence --- ## Practical Recommendations for Voice AI Navigation ### 1. Design 4 Primitives, Not 40 Tools **Don't:** ```typescript { "tools": [ "click_primary_cta", "click_secondary_cta", "click_header_link", "click_footer_link", "click_sidebar_link", "click_breadcrumb", "click_tab", "click_accordion", // ... 32 more click variants ] } ``` **Do:** ```typescript { "tools": [ { "name": "click", "description": "Click any clickable element", "parameters": { "selector": "CSS selector or text description" } }, { "name": "scroll", "description": "Scroll page or to element", "parameters": { "direction": "up | down | to-element", "amount": "optional distance" } }, { "name": "read", "description": "Extract text from page/element", "parameters": { "selector": "optional CSS selector (default: whole page)" } }, { "name": "navigate", "description": "Go to URL or find page by intent", "parameters": { "target": "URL or intent like 'pricing'" } } ] } ``` **Token savings:** ~13K tokens (40 specialized tools → 4 primitives) ### 2. Use Accessibility Tree, Not Raw HTML **Raw HTML approach:** - Inject entire DOM into context (50-100K tokens) - Model must parse HTML, find interactive elements, understand structure - Context window exhausted before navigation even starts **Accessibility tree approach:** - Browser generates semantic structure (headings, buttons, links, form fields) - ~2-5K tokens per page - Model sees clean representation: "Header > Navigation > Button 'Pricing'" **Mario's equivalent:** Using `grep`/`find` via bash instead of reading full files. Progressive disclosure. ### 3. Make Observability Non-Negotiable **What user should see:** 1. **Page analysis:** "Analyzing page structure... found navigation menu with 5 items" 2. **Decision reasoning:** "I see 'Pricing' in header (prominent) and footer (secondary). Clicking header link." 3. **Action confirmation:** "Clicked 'Enterprise Pricing' in navigation menu" 4. **Uncertainty acknowledgment:** "I found a Contact form but no direct pricing page. Should I fill out the form or keep searching?" **What user should NOT see:** 1. "Navigating..." (black box, no explanation) 2. Silent errors (agent clicked wrong element, user doesn't know) 3. Hidden retries (agent tried 5 links before finding right one, user unaware) **Mario's principle:** Full visibility into agent decisions. "If I can't see what it's doing, I can't correct it." ### 4. Enable User Context Control **pi's approach:** - Global `AGENTS.md`: rules that apply to all sessions - Project-specific `AGENTS.md`: rules for current codebase - User controls what gets injected into system prompt **Voice AI equivalent:** **Global navigation preferences** (`~/.demogod/NAVIGATION_PREFERENCES.md`): ```markdown ## Site-Specific Rules - Amazon: Always use top search bar, not category navigation - LinkedIn: Skip "Sign in to view" prompts, notify me instead - Documentation sites: Prefer search over navigation menus ## General Preferences - When multiple "Pricing" links exist, prefer header over footer - For forms: Ask before submitting, describe what will be submitted - If stuck after 3 failed navigation attempts, describe what you see and ask for help ``` **Project-specific** (`.demogod/navigation-context.md`): ```markdown ## This Site's Structure - Pricing page: /enterprise-pricing (not /pricing) - Features split across /features-overview and /features-comparison - Documentation: searchable via /docs/search, navigation menu incomplete ## Known Quirks - "Get Started" button in hero → signup form (not product tour) - Enterprise pricing requires contact form (no public pricing) ``` Agent reads these files, adapts behavior. User has full control, can edit anytime. ### 5. Reject MCP for Navigation (Use It for External Integrations) **Bad MCP use case:** - MCP server for "advanced navigation" (21 tools, 13.7K tokens) - Declares every possible DOM interaction upfront - 90% of tools never used in typical session **Good MCP use case:** - MCP server for Slack integration (send navigation results to Slack channel) - MCP server for calendar (schedule demo based on website availability) - External systems where progressive disclosure impossible **Mario's principle:** If tool can be invoked via bash + README, don't use MCP. Reserve MCP for truly external integrations. **Voice AI application:** - Don't use MCP for navigation tools (bloat) - Use MCP for: CRM integration (log navigation session to Salesforce), analytics (track conversion path in Mixpanel), notification (alert sales team when user reaches pricing page) --- ## The Contrarian Bet: Small Surface, Big Intelligence Mario's `pi` makes a contrarian bet: **Conventional wisdom:** More tools = better agent performance **Mario's bet:** Minimal tools + frontier model intelligence = competitive performance + better observability + easier maintenance **Terminal-Bench results:** Bet validated. `pi` holds its own against agents with 3-4x more tools. **Voice AI should make the same bet:** **Conventional wisdom:** Voice AI needs extensive tool catalog to handle navigation diversity (e-commerce vs SaaS vs documentation vs forums all have different patterns) **Contrarian bet:** 4 navigation primitives + Opus 4.5 intelligence = handles navigation diversity through model reasoning, not tool specialization **Evidence:** - Coding agents handle diverse codebases (Python/JavaScript/Rust/Go) with same 4 tools - Models trained on web scraping tutorials already understand navigation patterns - Accessibility tree provides clean semantic structure (no need to teach HTML parsing) **The payoff:** - 20x lower context overhead (1.5K vs 30K tokens baseline) - Full observability (no sub-agent black boxes) - Stable interface (4 primitives don't change, model intelligence improves) - Easier debugging (fewer tools = smaller failure surface) --- ## Final Thought: The Minimalist's Edge Mario's closing reflection: > "I'm pretty happy with where pi is. There are a few more features I'd like to add, but I don't think there's much more I'll personally need." After building a competitive coding agent with 4 tools and <1,000 token system prompt, Mario discovered the minimalist's edge: **constrained systems force clarity.** When you can't add another tool to patch a problem, you must: 1. Improve the existing tools (make them more composable) 2. Improve the system prompt (make instructions clearer) 3. Trust the model (let intelligence fill gaps) **Voice AI navigation faces the same choice:** **Path 1 (Maximal):** Add navigation tools for every edge case. Accordion expansion? New tool. Tab switching? New tool. Modal dismissal? New tool. Infinite scroll? New tool. **Result:** 40+ tools, 15K token overhead, brittle system (each edge case needs dedicated tool), black box failures (can't debug 40 tool interactions). **Path 2 (Minimal):** Ship 4 primitives. Trust Opus 4.5 to compose them. Accordion expansion? `click(selector)`. Tab switching? `click(tab_selector)`. Modal dismissal? `click(close_button)`. Infinite scroll? `scroll(down) + read()` loop. **Result:** 4 tools, 1.5K token overhead, resilient system (model adapts to new patterns), transparent failures (4 tool interactions easy to debug). Mario chose Path 2. Terminal-Bench validated the choice. Voice AI should choose Path 2 too. Because when MCP servers burn 13-18K tokens declaring tools you'll never use, the minimalist who ships 4 primitives and trusts the model isn't making a compromise. They're making a bet on intelligence over enumeration. And Mario's benchmark results prove: it's a bet worth making. --- *Keywords: minimal coding agent architecture, Voice AI navigation tools, MCP context overhead, progressive disclosure agents, tool composability vs specialization, pi coding agent benchmark, Terminal-Bench results, 4-tool navigation primitives, accessibility tree Voice AI, observable agent design* *Word count: ~4,200 | Source: mariozechner.at/posts/2025-11-30-pi-coding-agent | HN: 146 points, 48 comments*
← Back to Blog