Agent Harness Engineering: A Survey — ETCLOVG Taxonomy¶
Ch01.277 Agent Harness Engineering: A Survey — ETCLOVG Taxonomy¶
📊 Level ⭐⭐ | 13.3KB |
entities/agent-harness-engineering-survey-etcvlovg-taxonomy.md
Overview¶
Academic survey (2026, preprint) proposing agent harness engineering as an independent system layer, not merely a wrapper around a model. Authors from 9 institutions (CMU, Yale, Johns Hopkins, etc.) with Amazon affiliation. ^[agent-harness-engineering-survey-2026.md] Core claim: Real-world agent reliability depends more on the infrastructure harness than on the underlying model. The survey names this discipline, proposes a taxonomy, and maps 138 open-source projects onto it. ^["Agent Harness Engineering Survey 2026"]
ETCLOVG Taxonomy¶
The survey organizes the harness into 7 independent layers: ^["Agent Harness Engineering Survey 2026"] | Layer | Name | Scope | Primary Projects | |-------|------|-------|-----------------| | E | Execution environment | Sandboxes, microVMs, code runtimes, browser sandboxes, OS permissions | 20 | | T | Tool interface & protocol | MCP, A2A, tool description, discovery, invocation | 12 | | C | Context & memory | Short/session/persistent horizon, context drift mitigation | 9 | | L | Lifecycle & orchestration | Single-agent loop → multi-agent pipelines, issue-to-PR | 47 | | O | Observability & operations | Traces, costs, failures, agent-specific ops tools | 15 | | V | Verification & evaluation | Benchmarks, failure attribution, regression feedback | 21 | | G | Governance & security | Permission models, lifecycle hooks, audit infra | 14 | Key architectural insight: E, T, C, L form the structural pillars; O/V/G form the control plane. Compared to earlier 6-component frameworks, Observability and Governance are elevated to independent layers because in production they have separate tooling stacks and separate team ownership. ^["Agent Harness Engineering Survey 2026"]
Three Engineering Phases¶
| Phase | Period | Focus |
|---|---|---|
| Prompt engineering | 2022–2024 | Input text optimization — instructions, few-shot, reasoning templates |
| Context engineering | 2025 | "What should the model see at each step?" — retrieval, compaction, tool-result ranking |
| Harness engineering | 2026– | Full infrastructure wrapper — execution, tools, context, lifecycle, observability, verification, governance |
Cross-Layer Synthesis (3 Recurring Patterns)¶
- Cost–quality–speed trilemma: Stronger sandboxes, richer context, deeper evaluation improve quality but cost tokens, latency, and infrastructure. ^["Agent Harness Engineering Survey 2026"]
- Capability–control tradeoff: Larger tool menus, persistent memory, permissive sandboxes broaden coverage but enlarge blast radius. ^["Agent Harness Engineering Survey 2026"]
- Harness coupling problem: A component beneficial in isolation can degrade the whole system when combined — harness changes must be tested as system changes. ^["Agent Harness Engineering Survey 2026"] Also: Shift from agent frameworks (local abstractions) → agent platforms (durable workspaces, identity, observability, evaluation, governance, human handoff across many runs/users). ^["Agent Harness Engineering Survey 2026"]
Ecosystem Findings¶
- Densest coverage: E, T, L, V — coding/web/terminal/computer-use agents need runnable environments, tool contracts, control loops, repeatable evaluation.
- Thin coverage: C (context/memory embedded inside larger frameworks, rarely standalone) and O/G (more in commercial platforms than open source).
- Implication: Operational control has matured later than runtime and benchmark infrastructure.
5 Open Problems¶
- Hardening execution environments — security evals for prompt injection, goal misalignment; cost models for containers vs microVMs vs full desktop VMs ^["Agent Harness Engineering Survey 2026"]
- Reliable state in long-running agents — recast context management as state estimation with provenance, contradiction handling, staleness markers ^["Agent Harness Engineering Survey 2026"]
- Trace-native failure diagnosis — traces as primary object for outcome scores, trajectory quality, failure attribution, regression tests ^["Agent Harness Engineering Survey 2026"]
- Standard handoffs — not just text summary but intent, constraints, permissions, artifacts, provenance, budget state, risk level, trace history ^["Agent Harness Engineering Survey 2026"]
- Adaptive simplification — mechanisms to ablate/optimize/simplify harnesses as models improve ^["Agent Harness Engineering Survey 2026"]
Unique Contributions (vs Existing Wiki Harness Docs)¶
- First systematic taxonomy (ETCLOVG) with formal layer definitions
- 138 open-source projects mapped to taxonomy with primary/secondary layer coding
- Three-phase evolution narrative (prompt → context → harness engineering) providing historical framing
- Cross-layer synthesis patterns (trilemma, tradeoff, coupling) not found in other harness docs
Links¶
See Also¶
- — general harness architecture patterns
- — academic papers on harness evolution
- — long-running agent engineering
- Agent Harness Engineering Survey 2026 — raw source
深度分析¶
论文定位与领域意义¶
本文是首篇将 agent harness engineering 确立为独立系统学科的学术综述。区别于此前将 harness 视为 "model wrapper" 的朴素观点,论文通过 ETCLOVG 七层 taxonomy 证明了:生产级 agent 可靠性由基础设施层决定,而非底层模型能力决定。 ^["Agent Harness Engineering Survey 2026"]
ETCLOVG 的结构哲学¶
七层中 E-T-C-L 为结构支柱,O-V-G 为控制平面,这一划分有重要的工程含义: ^["Agent Harness Engineering Survey 2026"]
- E (Execution) 和 T (Tool) 解决 "在哪跑、用什么能力" 的问题,是 agent 与外界交互的物理边界
- C (Context) 和 L (Lifecycle) 解决 "看见什么、怎么协作" 的问题,是 agent 的认知与行为编排中枢
- O (Observability) 提供运行时感知,V (Verification) 提供离线评估闭环,G (Governance) 提供安全合规约束——三者共同构成生产级别的 control plane 相较于此前 AgentGym、WebArena 等单点评估框架,ETCLOVG 的核心洞察是:O/V/G 在生产环境中拥有独立技术栈和独立团队归属,这使得将它们降级为 L 的子功能是不合理的。 ]] ^["Agent Harness Engineering Survey 2026"]
三阶段演化逻辑的深层含义¶
| 阶段 | 核心问题 | 工程杠杆 |
|---|---|---|
| Prompt engineering (2022–2024) | 如何写好输入文本? | 指令模板、few-shot、CoT |
| Context engineering (2025) | 每一步给模型看什么? | 检索、压缩、工具结果排序 |
| Harness engineering (2026–) | 如何设计完整的基础设施 wrapper? | E/T/C/L/O/V/G 全栈 |
| 每一阶段的跃迁都伴随着"问题域"的重新定义——不是前一层被抛弃,而是前一层变成充分条件中的必要不充分条件。 ^["Agent Harness Engineering Survey 2026"] |
跨层合成的三个悖论¶
Cost–quality–speed trilemma:这一 triad 是所有生产系统面临的根本性约束。不同于传统软件系统可以在三者间找平衡点,LLM agent 的特性使得每一项的提升都可能非线性地影响另外两项(例如更严格的 sandbox 会同时降低 speed 并提升 cost)。 ^["Agent Harness Engineering Survey 2026"] Capability–control tradeoff:这实际上将安全和功能统一在同一个设计轴上——工具越多、能力越强,blast radius 越大。 ^["Agent Harness Engineering Survey 2026"] Harness coupling problem:Local optimization 导致 global degradation,这在 agent 系统中尤为突出,因为 agent 的各层之间通过 context 和 state 存在隐式耦合。 ^["Agent Harness Engineering Survey 2026"]
生态映射的深层洞察¶
论文对 138 个开源项目的编码揭示了一个非均匀分布: ^["Agent Harness Engineering Survey 2026"]
- E/T/L/V 高密度 → 这些是 "冷启动" 必需品,做 coding/web/terminal agent 必须首先解决执行、工具、控制流和评测
- C 低密度 → Context/memory 多被嵌入框架内部而非独立输出,说明该层标准化程度低
- O/G 低密度 → 这两个层面在商业平台内部更成熟,开源社区相对滞后,反映出 O/G 需要更多生产级反馈才能成熟
实践启示¶
对 agent 开发者的建议¶
- 以 ETCLOVG 审视现有系统:将你的 agent 项目逐层对照 ETCLOVG,检查是否有明显缺失层——大多数项目在 E/L/V 较完整,但 C/O/G 常被忽视 ^["Agent Harness Engineering Survey 2026"]
- 优先补齐控制平面:如果你在构建生产级 agent,先投入 O(可观测性)和 G(治理)往往比持续优化 E/T/C/L 回报更高 ^["Agent Harness Engineering Survey 2026"]
- 跨层变化需要端到端测试:任何单层优化(更长的 context、更大的 tool pool、更严格的 sandbox)都可能因耦合效应而整体降级,必须作为系统变更测试 ^["Agent Harness Engineering Survey 2026"]
对平台/基础设施团队的启示¶
- O/V/G 应作为独立服务而非功能特性:Observability、Verification、Governance 各自有独立技术栈(tracing、成本分析、安全审计),不应被降级为某个 agent framework 的子模块 ^["Agent Harness Engineering Survey 2026"]
- 关注 C 层标准化机会:Context/memory 是当前最缺乏独立组件的层,存在构建专用 "context middleware" 的机会 ^["Agent Harness Engineering Survey 2026"]
- 自适应简化的工程路径:随着模型能力提升,harness 应能动态削减——例如当模型 self-correct 能力足够强时,可简化 V 层部分校验循环 ^["Agent Harness Engineering Survey 2026"]
对研究者的开放问题¶
论文提出的五个 open problems 中,最值得关注的是: ^["Agent Harness Engineering Survey 2026"]
- Trace-native failure diagnosis:将 traces 从调试材料升级为一等公民的评价对象,这需要新的数据结构设计和评价方法论
- Standard handoffs:跨 agent/tool/human 的 handover protocol 目前是空白,协议层面标准化将释放大量多智能体协作场景的潜力