daita@system:~$ cat ./building_ai_coding_agents_for_the_terminal.md

Building AI Coding Agents for the Terminal: What OpenDev Learned the Hard Way

Creado: 2026-05-04 | Tamaño: 18503 bytes

TL;DR

A March 2026 paper from OpenDev documents the engineering of a terminal-native AI coding agent. Central finding: context is the binding constraint. Tool outputs consume 70-80% of the window in a typical session. OpenDev's response is adaptive compaction (54% peak reduction), schema-level safety that removes capabilities rather than forbidding them, and a multi-model architecture that routes each role to a different LLM. The deeper read is that an agent runtime is starting to look less like a chat wrapper and more like an operating system.

Why the Terminal?

The paper argues that CLI-based agents are fundamentally different from IDE plugins. In the terminal, agents operate where developers actually manage source control, run builds, and deploy environments. That proximity gives them unprecedented autonomy for long-horizon tasks, but it also means they can do real damage.

OpenDev's answer is a compound AI system organized into five layers: Entry & UI, Agent Core, Tool & Context, Persistence, and Safety. The interesting design decisions live in three areas: how they handle context, how they enforce safety, and how they absorb LLM imprecision.

Context is the Central Budget

This is the paper's most important insight, and it echoes what we've seen across the ecosystem: context engineering is what separates agents that work from agents that don't.

Tool outputs consume 70-80% of context in typical sessions. That's not a minor overhead, it's the dominant cost. OpenDev attacks this from multiple angles:

  • Prompt composition: A priority-ordered registry of sections, filtered by runtime conditions (provider type, mode, etc.). Provider-level prompt caching splits the system prompt into a stable prefix and dynamic suffix.
  • Tool output summarization: Per-tool-type compression. File reads get line-count annotations, search results are truncated to top hits, command outputs are tail-trimmed. Large outputs are offloaded to disk with only a summary injected into context.
  • Adaptive context compaction: Five graduated reduction stages keyed to context pressure: warning at 70%, observation masking at 80%, fast pruning at 85%, aggressive masking at 90%, and full LLM-based summarization at 99%. The cheap stages reclaim space before the expensive ones fire. Reported result: ~54% reduction in peak context consumption, with sessions extending from 15-20 turns to 30-40 turns before emergency compaction.
  • Dual memory: Episodic memory (session-specific) and working memory (persistent project knowledge) with a reflection pipeline that updates a project playbook with lessons learned.

The compaction approach matters. Rather than a binary "keep or drop" decision, observations degrade gracefully, which often eliminated emergency summarization entirely.

Attention Decay Is Real

After 15-20 tool calls, agents reliably stop following their initial instructions. The rules are still in the context window; attention has shifted to recent messages. This aligns with research showing LLMs lose track of context in extended conversations.

OpenDev's solution: a catalog of event-driven system reminders injected as role: user messages at maximum recency. The paper notes that user-role reminders consistently outperform system-role reminders for compliance. When the agent reads files consecutively, a reminder nudges it toward search tools. When it repeats errors, a reminder suggests alternative approaches. Each reminder type is rate-capped to prevent flooding.

The mechanism works because attention degrades unevenly across positions. Re-anchoring constraints near the cursor of attention keeps them alive when the system prompt no longer does.

Safety Through Architecture, Not Instructions

This is where OpenDev makes its strongest design argument. Their planning agent uses a subagent compiled with a restricted tool registry: write tools are simply absent from the schema. The model cannot reason about capabilities it cannot see.

Compare this to the alternative: giving the planning agent all tools but instructing it not to use the write ones. As the paper notes, schema-level enforcement is fundamentally more robust than behavioral instructions. An LLM can argue around a permission check. It cannot call a tool that doesn't exist in its schema.

This connects to a broader pattern in multi-agent delegation: the more you can encode constraints into architecture rather than prompts, the more reliable your system becomes.

Where the Schema Boundary Leaks

The schema is a hard boundary only if the attacker cannot edit it. Tool descriptions and parameter schemas ingested from external MCP servers are themselves inputs to the model. A compromised MCP server can ship a tool description that reads as a benign read-only operation but executes side effects, or smuggle instructions into the description text that the planning agent absorbs as part of its context.

OpenDev's restricted tool registry holds against an LLM that wants to do the wrong thing. It does not hold against a supply-chain attack on the registry itself. Treat MCP server provenance with the same rigor as any other dependency: pinned versions, signed manifests, per-project allowlists, and audit logs of which tools were loaded into which session.

The Extended ReAct Loop

OpenDev extends the classic ReAct pattern with four phases:

PhaseFunctionKey Mechanism
0Context managementAdaptive compaction at 70% threshold
1ThinkingExplicit reasoning before action
2ActionTool execution
3DecisionDoom-loop detection via tool-call fingerprinting

The doom-loop detector fingerprints tool calls across a recency window. When the same fingerprint reappears past a threshold, a warning is injected and execution is skipped. A second occurrence escalates to an approval-based pause.

This is the kind of defensive mechanism you only build after watching an agent burn through API credits calling the same failing command in a loop.

Approval Fatigue

Approval persistence addresses a UX failure mode that defeats permission systems. Constant prompts wear users down until they approve reflexively, which is worse than no permission system at all. OpenDev persists approval decisions per operation type across the session, so the user makes each kind of permission call once and the system remembers.

Multi-Model Architecture

OpenDev defines five model roles, each independently bound to a user-configured LLM:

  • Action: Primary execution model with tool access
  • Thinking: Extended reasoning without tool distraction
  • Compact: Fast summarization during context compression
  • Critique: Self-evaluation, Reflexion-inspired
  • VLM (Vision): Screenshot and image processing

The key insight is that not every agent action requires your most capable (and expensive) model. Context summarization doesn't need frontier reasoning. Quick file lookups don't need chain-of-thought. A four-level binding hierarchy (session, agent, workflow, LLM) enables per-task cost/latency optimization.

What This Costs

The five-role split is also a unit-economics decision. The Compact role runs dozens of times in a long session and never needs frontier reasoning, so a Haiku-class model produces serviceable summaries at roughly an order of magnitude less per token. Vision and critique follow similar logic. The Action and Thinking roles, where reasoning quality binds, stay on the most capable model.

Routing the cheap roles to a cheap model is what makes long-horizon agent sessions financially viable. A naive single-model architecture that ran every step on a frontier model would burn 5-10x the budget for marginal quality gain on tasks the cheap model handles well. The hierarchy is the cost story, not just an engineering taxonomy.

The corollary for buyers: when a vendor charges a flat rate per session, ask which roles run on which model. Margins fall apart on any deployment that does not route.

Agents as Operating Systems

Step back from the individual mechanisms and a pattern emerges. Schema-level tool filtering is capability-based security, the same idea behind seccomp-bpf, Capsicum, and Plan 9's per-process namespaces. Doom-loop detection is livelock detection, lifted from kernel scheduling. Shadow git snapshots are copy-on-write filesystems (ZFS, btrfs) applied to agent sessions. Adaptive context compaction is page eviction under memory pressure. System reminders against attention decay are cache reloads when the working set drifts.

OpenDev did not set out to clone an OS. The OS abstractions emerged because the same engineering pressures that shaped Unix shape any system that orchestrates untrusted, long-running, resource-constrained processes against shared state. An agent that takes irreversible actions on a developer's machine sits closer to init than to a chatbot.

The framing tells you where the next pressure points appear. Operating systems eventually grew memory protection, process priorities, syscall auditing, and quota systems. Agent runtimes will too. Read OS design papers, not just LLM ones.

Shadow Git Snapshots: Undo for Everything

One of OpenDev's quieter but most practical design choices lives in the persistence layer. Every file change, including side effects from shell commands, gets a shadow git snapshot. This gives the agent per-step undo across the entire session.

Sessions are written atomically via temp-file-then-rename to prevent corruption from crashes, with auto-save after each agent turn. A self-healing session index handles fast listing. But it's the shadow snapshots that matter most: when an agent runs sed -i 's/foo/bar/g' *.py and breaks something, you can roll back to the exact state before that command ran. Not just the files the agent explicitly edited, but everything the shell touched.

LSP: Structural Code Understanding

Most coding agents treat code as text. OpenDev integrates the Language Server Protocol for symbol operations across 30+ languages: find references, rename symbols, replace across files. This gives the agent structural understanding of code rather than relying purely on regex and grep.

The difference matters. When an agent needs to rename a function, text search finds string matches. LSP finds actual symbol references, distinguishing between a function call, a variable with the same name, and a comment mentioning the function. For large codebases, this is the difference between a clean refactor and a broken one.

Designing Tools for Approximate Outputs

LLMs produce approximately correct outputs. OpenDev's tools are designed to absorb that imprecision rather than fail on it:

  • Edit tool: A 9-pass fuzzy matching chain to handle imprecise edit targets. When the model gives you a slightly wrong string to match against, the tool tries progressively looser matching strategies before giving up.
  • Shell commands: Auto-detection of server-like processes via regex, promoting them to background execution. If the model runs npm start without backgrounding it, the system handles it.
  • Dependencies: Auto-installation on first use. Don't make the agent figure out that it needs to pip install something first.

A related lesson: deterministic operations should bypass the LLM entirely. Session management, mode switching, configuration lookups - these have known, correct answers. Routing them through the model wastes tokens and introduces unnecessary failure modes. If the answer is deterministic, write a function, not a prompt.

This design philosophy is worth internalizing: demanding exact correctness from the model wastes cycles in error-recovery loops. Build tolerance into the tooling where the model must be involved, and remove the model entirely where it doesn't need to be.

Lazy Discovery: 40% to Under 5%

Eager loading of MCP tool schemas consumed 40% of context before the first user message. That's nearly half your context budget gone before the user even types anything.

Lazy discovery, where tools are loaded only when relevant, reduced this to under 5%. This mirrors how agent skills work best: you don't front-load everything, you surface capabilities on demand.

What the Benchmarks Say

The paper surveys the current state honestly. On Terminal-Bench, frontier agents resolve fewer than 65% of CLI tasks. On LongCLI-Bench, pass rates drop below 20% for long-horizon tasks. This gap between benchmark scores and real-world performance remains the central challenge.

Meta Context Engineering (MCE), a bi-level framework where a meta-agent refines skills that a base-agent executes, tells a complementary story on a different axis. Across five benchmarks, MCE delivers an 89.1% average relative improvement over the base model, versus 70.7% for the prior state-of-the-art context-engineering method. On the FiNER training task, MCE reaches the same accuracy in 1.9 hours where the prior method takes 25.8 hours, a 13.6x speedup, and converges at 95% accuracy after 450 rollouts where the prior method needs 2,169. Different benchmark, different metric, but the message rhymes: better context management improves both quality and the cost of getting there.

Five Lessons for Agent Builders

The paper distills five cross-cutting design tensions:

  1. Context pressure is the central constraint. Not model capability, not tool design, context. Optimize for it relentlessly.
  2. Steering decays over long horizons. System prompts lose influence after 15-20 tool calls. Build active countermeasures.
  3. Safety through architecture, not instructions. Remove capabilities from the schema rather than asking the model not to use them.
  4. Design for approximation. Your tools must absorb LLM imprecision. Fuzzy matching, auto-detection, auto-installation.
  5. Lazy loading is essential at scale. Front-loading tool schemas, skills, and metadata burns context before work begins.

These principles emerged from running a coding agent in production, not from a whiteboard.

Convergent Evolution, With a Caveat

OpenDev's architecture maps closely onto Claude Code despite being independently developed. Both use system reminders against attention decay. Both enforce safety through schema-level filtering. Both implement context compaction with graceful degradation. Both detect and break doom loops. Both lazy-load tools.

The convergence is real but not unconditional evidence of correctness. Independent teams arriving at the same patterns can mean the patterns are load-bearing, or it can mean every team is fine-tuning the same base models on the same public RLHF data and inheriting the same failure modes. Probably some of both. Treat the parity as strong prior, not proof.

The next pattern visible at the horizon is per-step model routing as mixture-of-experts at the agent layer. Today's five fixed roles become tomorrow's auction, where each step picks the model whose latency-cost-quality envelope fits the task. Routing decisions become a first-class artifact, logged and tunable, not a hidden hyperparameter.

Teams working with Claude Code, Symphony, and similar tools are discovering the same lessons. The question is no longer whether these patterns matter, but how fast they become baseline expectations.

If You're Shipping an Agent Next Week

Three concrete moves drawn from the paper and from independent practice:

  1. Measure your context budget before adding features. Instrument what fraction of your window is system prompt, tool schemas, conversation history, and tool output. The fix that buys you the most headroom is almost always lazy-loading the largest source.
  2. Remove tools, do not forbid them. When a planning step should not write, give it a subagent with a registry that lacks write tools. Behavioral instructions degrade across long sessions; schema does not.
  3. Pin model versions and route by role. A single frontier model behind every step is the wrong default. Start with two roles (primary plus compaction), pin specific model IDs, and instrument cost per session. Add roles when cost or latency forces the issue.

These three move an agent from "demo on Twitter" to "service that someone pays for and depends on."


References

  1. Building Effective AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned - Original paper by Nghi D. Q. Bui (OpenDev)
  2. Anthropic: Effective Context Engineering for AI Agents - Context engineering practices
  3. Meta Context Engineering via Agentic Skill Evolution - bi-level meta/base agent framework
  4. Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces - CLI agent benchmarks
  5. LongCLI-Bench: Long-horizon Agentic Programming in CLIs - Long-horizon CLI benchmark
  6. Get Shit Done: The Context Engineering Layer That Makes Claude Code Actually Reliable - Daita blog
  7. Your LLM Forgets What You Said Two Messages Ago - Daita blog
  8. Agent Skills: The Paradigm Shift Hiding in Plain Text - Daita blog
  9. Intelligent AI Delegation: Why Multi-Agent Systems Need More Than Heuristics - Daita blog
  10. Your LLM Scores 88% on Code Benchmarks. In Production, It Hits 30%. - Daita blog
  11. Symphony: OpenAI Ships a Spec, Not a Library - Daita blog

daita@system:~$ _