跳转至

nvidia extreme co design agentic systems

Ch04.040 nvidia extreme co design agentic systems

📊 Level ⭐⭐ | 25.3KB | entities/nvidia-extreme-co-design-agentic-systems.md

Building for the Rising Complexity of Agentic Systems with Extreme Co-Design | NVIDIA Technical Blog

Building for the Rising Complexity of Agentic Systems with Extreme Co-Design | NVIDIA Technical Blog DEVELOPER Home Blog Forums Docs Downloads Training Join Technical Blog Subscribe Related Resources Agentic AI / Generative AI Building for the Rising Complexity of Agentic Systems with Extreme Co-Design May 05, 2026 By Eduardo Alvarez , Benjamin Klieger and Graham Steele Like Discuss (0) L T F R E AI-Generated Summary Like Dislike Agentic AI architectures feature hierarchical agents and sub-agents that manage large, variable context windows, tool calls, and memory statefulness, causing structurally probabilistic token consumption patterns that challenge traditional serving economics. Real-world agentic sessions, such as those run by Claude Code, demonstrate token volumes scaling from tens of thousands to over 150,000 tokens per context window, necessitating advanced prompt caching, context compaction, and specialized hardware like NVIDIA CMX to maintain economic and latency efficiency. The NVIDIA Vera Rubin platform employs extreme co-design across multiple specialized chips (NVL72, Vera CPU, Groq 3 LPX, NVLink 6, ConnectX-9, BlueField-4, Spectrum-X) and software optimizations (Dynamo, NVFP4, TRT-LLM WideEP, Speculative Decoding) to overcome throughput-latency tradeoffs, enabling large-scale, low-latency inference on trillion-parameter MoE models with 400k token contexts at competitive costs. AI-generated content may summarize information incompletely. Verify important information. Learn more Generative AI s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different.  Agents don’t follow a pre-determined sequence of actions. They call tools, spawn sub-agents with different tasks and models, retain information in memory, manage their own context window, and decide for themselves when they’re finished. In doing so, these systems push token consumption, context length, and latency requirements into extremely demanding regions  exactly the pressures now shaping the NVIDIA extreme co-design stack and the NVIDIA Vera Rubin platform. This post analyzes that evolution across three parts:  How agents consume tokens Why their economics break under conventional serving What an infrastructure stack purpose-built for agents looks like Transition to agents from chatbots As shown in Figure 1, below, the popularization of generative AI began with a simple interaction model: one user message, one chatbot message, repeat. The model responds from memory in the context window, the chat history grows linearly, and demands on the system are predictable.  Figure 1. Three AI interaction patterns ranked by complexity: Standard chatbot (linear);  chat with tools (bounded, variable); and agentic (chained, high entropy) The introduction of tool calling fundamentally shifts the way an AI chatbot operates. Once a model can call a calculator instead of guessing at math, the entire workload changes. Since tool responses are added directly to the context window, they introduce unpredictability to the input sequence. This happens because the size of a tool’s output depends on the specific query and the tool’s design, including how it handles relevant data. Even though the process is still bounded by a prompt and a final answer, the simple predictability of a standard chat is lost. This dynamic becomes even more complex when we introduce agents. If a model has the power to call one tool, it also has the power to decide how many tools to use and in what order to use them. For instance, an agent tasked with drafting an email might: Read existing correspondence Check drive for context Confirm a recipient’s identity Then draft the email This chaining is where models become agents, and where the workload shifts from linearly predictable with probabilistic spikes to structurally probabilistic, such that the shape of each agent session can behave very differently from one another.  Characteristics of agentic architectures The modern agentic architecture is composed of a mix of agent hierarchies and optimization techniques that enable effective context management, tool usage, and task optimization:  Figure 2. Simple flow diagram of a standard agent/sub-agent architecture Primary agent: Responsible for the delivery of the entire task end-to-end. May orchestrate sub-agents that tackle subtasks. Typically, the primary agent is powered by the smartest model and talks directly with the user. Sub-agents: Spawned by the primary agent to handle narrower tasks, with ability to self-manage their context windows like the primary agent. Often, sub-agents are architecturally identical or very similar to primary agents, except with a more limited task scope from the prompt provided by the actual primary agent to the sub-agent.  File system statefulness: Additional statefulness derived from agents writing memory and tool call output to files and later searching or re-reading their contents. This serves as a method of context management and memory. Summarization and compaction: A technique where the context window of an agent is summarized and thereby compressed to make space for new information and reduce input processing costs. Figure 3. Simplified qualitative graph illustrating input token growth per request across an agentic session Some of the most popular agentic tools today follow similar architectures. Primary agents in tools like Claude Code frequently delegate work to sub-agents to exploit smaller context windows and parallelize tasks. Because the system must process input tokens during every single inference step, utilizing smaller contexts drives greater efficiency and results in lower input token processing costs. This architecture provides a necessary defense against a phenomenon called context rot , where an expanding context inevitably degrades output quality . When tasks grow in complexity, deliberate compaction events force sharp drops in the context window of the main agent to compensate for the inability to scale tokens infinitely.  Workload dynamics and economics of agentic systems In their report on building a multi-agent system , Anthropic estimated that these systems consumed up to 15x more tokens than standard chat. This significant increase requires improved unit economics for tokens in order for these applications to become economically profitable at scale. Addressing this inference economics challenge requires a deep understanding of the system-level token throughput and latency requirements that govern agentic economics. The cost and complexity of these workloads is best understood through the analysis of a real agentic session. Figure 4 provides a measured example of a Claude Code coding task. The lines on the chart represent the input sequence length (context + ISL) at every request made during the session by sub-agents (orange) and the main agent (grey). Even in a single session, the trace makes clear why long-context capacity, cache programmability, and predictable per-token latency matter as much as raw model quality. Figure 4. Context growth trace from a live Claude Code agentic coding session spanning 283 requests across a main agent and sub-agents over 33 minutes This 33-minute session tracks 58 main-agent turns coordinating 225 sub-agent invocations. Across 283 inference requests, the context window grows from 15K tokens to a peak of 156K before a context compaction event reduces it to approximately 20K. The trace makes it clear that agent token consumption is shaped as much by agentic system behavior as by the nature of the tasks. The primary agent accumulates input context quickly when it is not delegating or compacting context, which causes cache-read input token costs to recur every turn. Across the first 40 turns, the main agent averages roughly 85K tokens of context and accumulates around 3.5 total processed million input tokens before adding another million in the session following a compaction. These are exactly the conditions where high bandwidth memory (HBM), high-throughput platforms such as Vera Rubin NVL72 become relevant, because long-context prompts need to stay economically tractable while prefill demand continues to scale. Prompt caching is what makes this pattern workable. Without KV cache re-use, every input token would need to be fully reprocessed. Popular API providers discount cache hits by approximately 90%, so at a 95% cache hit rate, input processing cost drops by about 85%; without prompt caching, the cost here would be roughly 6x higher. Coding agents commonly sustain 95-98% cache hit rates, especially when tool output stays small. That is why prompt caching is increasingly a systems problem rather than just an API feature: Sustaining high cache hit rates depends on efficient CPU-side KV cache management and purpose-built high-capacity context storage, such as NVIDIA CMX, to preserve long prefixes and restore them quickly as sessions scale. The 225 requests in the sub-agent traces highlight separate inference sessions that each utilize unique contexts and specific tool definitions. Sub-agents often increase total output token volume, but they lower input cost by starting from fresh context windows and carrying forward only what is relevant to the delegated task. They can also run on smaller models, which reduces latency and cost while still preserving accuracy for narrower tasks. Context compaction is equally important. It provides a mechanism to avoid hitting the context window limit, reduces the effects of context rot, and provides cost-management side-effects. Reducing the context window from 156K tokens to 20K forces an immediate reduction in cached input token spend and creates room for the next set of tasks.  In Figure 5, below, it is qualitatively evident that most processed tokens are retrieved from cache. Once that happens, network and memory-system behavior start to affect user-perceived latency directly, and low-latency fabrics such as NVLink 6, ConnectX-9, BlueField-4, and Spectrum-X help keep shared context accessible and reduce recomputation penalties as sessions fan out across multiple agents.  Figure 5. A token caching breakdown trace from a live agentic Claude Code coding session, distinguishing cached from uncached input tokens across 283 requests over 33 minutes; same session as Figure 4 From this example, it becomes clear that agent token dynamics are quite complex and token consumption can quickly scale across primary and sub-agents. To understand the challenges of scaling these applications under this growing token demand, we must consider the delivered performance requirements. Performance requirements of agentic workloads Unlocking the value of agentic workloads requires high model intelligence, large context, and low latency. The faster these agents produce insights, the more exponentially valuable they become. This speed shortens R&D cycles, improves harness control, and enables complex multi-agent loops. Because the tokens enabling these capabilities are inherently expensive to process, delivered performance stands as the critical lever to making these systems both scalable and profitable. Driving down the cost of these tokens requires producers to sustain scale in the high interactivity region for large models across large contexts. Figure 6, below, illustrates this bottleneck through a standard inference performance pareto. The left side of the curve offers high throughput but at the lower extremes of interactivity where agentic workloads cannot function. Figure 6. A qualitative pareto curve, illustrating the throughput-interactivity tradeoff across batch, standard coding, and agentic application workloads on a per-GPU basis These workloads must instead shift to the high interactivity side of the curve (right) to operate successfully. Agentic systems consume massive token volumes while demanding fast generation speeds to maintain end-user interactivity. The problem is that achieving this low latency typically causes system throughput to drop dramatically. Diminished throughput leads to prohibitive per-token costs, making agentic systems economically challenging at scale.   Figure 7. A qualitative pareto curve illustrating the cost-per-million-token versus interactivity tradeoff across batch, standard coding, and agentic application workloads Breaking this bottleneck requires a complete shift in infrastructure design. Modern GPUs offer enormous compute and substantial bandwidth, but sustaining scale at low latency demands more than any single architecture can provide. The answer is extreme co-design. This approach optimizes inference across hardware specialized for each phase and delegates these unique challenges to an entire platform rather than just one processor. Why one processor isn t enough These unique demands won t be resolved by simply adding more compute FLOPs and memory capacity. The demands are due to the architectural properties of how agents work, and no single processor can solve them simultaneously.  Figure 8. Diagram highlighting a part of the NVIDIA extreme co-design strategy highlighting benefits for agentic workloads What is needed is a platform where each bottleneck maps to specialized hardware, orchestrated as a unified system with extreme co-design (see Figure 8, above):  The Platform Vera Rubin NVL72 handles capacity and compute at one-tenth the cost per million tokens of Blackwell . The HBM capacity is what makes long-context pipelines tractable; the compute density absorbs prefill cost at scale. Vera CPU closes the tool-execution gap with lower agent latency, seamless KV cache offload, and unified CPU-GPU execution. Groq 3 LPX breaks the throughput-latency tradeoff. SRAM-first architecture delivers tightly bounded, low-jitter token generation critical when variance in any single agent propagates through the entire pipeline. The Networking Chips (NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-X Ethernet) create a unified, low-latency serving fabric for agentic workloads, so agents can coordinate faster, keep shared context accessible, and avoid costly recomputation as sessions grow. Software Stack Components: Dynamo and Attention-FFN Disaggregation (AFD) creates a coherent serving path by splitting work across the best-suited processors and coordinating execution to reduce resource contention and latency. Additionally, Dynamo exposes cache programmability to the agent harness. NVFP4 lowers precision overhead so MoE agents can run with lower latency, higher throughput, and lower memory pressure without sacrificing intelligence. TRT-LLM WideEP optimizes large expert parallelism for frontier MoEs, allowing agents to provide high intelligence responses with lower latency and higher throughput. Speculative Decoding cuts agent response latency by generating likely tokens in parallel and verifying them quickly, accelerating low-latency inference for large models. By combining these seven chips and a software stack through extreme co-design, the Vera Rubin platform can deliver  400+ tokens per second per user on trillion-parameter MoE models with large 400k context. This level of performance shifts the historical trade off paradigm for agents no longer do you need to compromise quality with smaller models and limited context windows in order to deliver high per user speeds and high system throughput. In this region agentic architectures become viable products at scale rather than expensive experiments. For more details on the Vera Rubin platform specs and LPX, explore their respective launch day blogs: Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI Supercomputer Inside the NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Discuss (0) Like Tags Agentic AI / Generative AI | Data Center / Cloud | General | BlueField DPU | Dynamo | NeMo | NVLink | TensorRT-LLM | Intermediate Technical | Deep dive | Groq 3 LPX | NemoClaw | Vera Rubin About the Authors About Eduardo Alvarez Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies. View all posts by Eduardo Alvarez About Benjamin Klieger Benjamin Klieger is an engineering manager at NVIDIA, where he works on applied AI agent architecture research with a focus on accelerating agent performance through a co-designed software and hardware stack. Benjamin also works on accelerating software development velocity through the design and deployment of frontier coding agent systems. Before NVIDIA, Benjamin was previously Head of Agents at Groq, where he led their research agent line Compound. View all posts by Benjamin Klieger About Graham Steele Graham Steele is a product marketing lead at NVIDIA, where he focuses on accelerated computing solutions for the data center, AI inference at scale, and LPX. He has a deep background in product marketing, product management, and go-to-market strategy for AI and semiconductor technologies, with a focus on bringing accelerated computing platforms to enterprise customers. Before NVIDIA, Graham held product marketing and product management roles at Groq and Intel. View all posts by Graham Steele Comments Related posts Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Scaling Autonomous AI Agents and Workloads with NVIDIA DGX Spark Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform Inside the NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin Platform NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale NVIDIA NIM Offers Optimized Inference Microservices for Deploying AI Models at Scale Related posts How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car Optimize Supply Chain Decision Systems Using NVIDIA cuOpt Agent Skills Optimize Supply Chain Decision Systems Using NVIDIA cuOpt Agent Skills How to Build, Run, and Scale High-Quality Creator Workflows in ComfyUI How to Build, Run, and Scale High-Quality Creator Workflows in ComfyUI Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl Powering AI Factories with NVIDIA Enterprise Reference Architectures Powering AI Factories with NVIDIA Enterprise Reference Architectures L T F R E

相关实体

原文存档

深度分析

1. Agentic token消耗是"结构性概率"而非"线性可预测"——这是Serving Economics的根本挑战

从标准chat到tool-calling到agentic,token消耗模式发生了质变:标准chat线性可预测,tool-calling引入变量但仍有界,agentic则是"结构性概率"——每个session的形状可能截然不同(Anthropic数据显示达15x于标准chat)。这意味着服务于chat的静态资源配置(固定context长度、batch优化)在agentic场景下完全失效,需要全新的基础设施范式。

2. Sub-agent架构是降低主agent输入成本的核心机制

实测Claude Code 33分钟session中,225个sub-agent请求各自使用独立context——sub-agent从新鲜context启动而非继承主agent的156K token积累,这大幅降低了全局输入token成本。同时sub-agent可运行更小模型,进一步压缩成本。这解释了为何主流coding agent都采用分层委托架构。

3. KV cache是agentic经济的命脉——95%cache hit rate可将cost降低~85%

没有prompt caching,每个输入token都需全量重新计算。主流API provider对cache hit打折约90%,在95% hit rate下输入处理成本降低~85%。coding agents因tool输出小,天然维持95-98%的cache hit率。这说明prompt caching已从API feature演变为系统问题——其效果取决于CPU侧KV cache管理和CMX等专用context存储硬件。

4. Context compaction是防止context rot和管理成本的必要机制

实测中context从峰值156K强制压缩到~20K。这种compaction event有三重功效:避免触碰context window上限、缓解context rot(扩展context不可避免地降低输出质量)、以及即时降低cached input token支出。它揭示了一个关键现实:无限扩展context不是解决方案,上下文管理策略才是。

5. Extreme co-design的核心洞见:每个agentic bottleneck需要专门硬件映射,无单一处理器能同时解决所有问题

Vera Rubin平台采用7芯片分工:NVL72处理HBM容量和大规模prefill计算,Vera CPU关闭tool-execution gap并处理KV cache offload,Groq 3 LPX以SRAM优先架构打破throughput-latency tradeoff(低jitter token生成),网络芯片(NVLink 6/ConnectX-9/BlueField-4/Spectrum-X)提供统一低延迟fabric。软件栈进一步通过Dynamo/AFD disaggregation、NVFP4、TRT-LLM WideEP、Speculative Decoding分层优化。

实践启示

1. 在系统设计阶段而非产后优化阶段就考虑inference效率

传统做法是先训练大模型再优化推理;extreme co-design的核心转变是将效率作为architecture设计的第一约束。需要为agentic workload选择专门硬件(如Vera Rubin NVL72处理长context),而非假设单一GPU可解决所有问题。

2. Sub-agent架构是控制成本的必选设计模式

通过主agent委托sub-agent处理子任务,可实现:输入token成本降低(sub-agent从新鲜context启动)、latency降低(sub-agent可并行执行)、cost降低(sub-agent可用更小模型)。设计agentic系统时应将任务分解和并行委托作为核心架构决策,而非事后优化。

3. 构建智能context管理策略而非依赖无限context扩展

Compaction event和sub-agent fresh context的实测数据表明,context window不是越大越好——超出一定规模后输出质量反而下降(context rot),成本也失控。应设计主动的context压缩策略(如摘要、修剪不相关历史)和明确的compaction触发条件。

4. 投资KV cache基础设施以获取agentic经济的最大杠杆

在硬件层面,需要支持高命中率的KV cache管理系统(NVIDIA CMX等)和低延迟网络(NVLink 6/Spectrum-X)保证cache访问不成为瓶颈。在架构层面,设计prompt/prefix共享机制提升cache hit rate,特别是在sub-agent场景下。

5. 利用NeMo工具链建立agentic系统的可观测性

NeMo Agent Toolkit可对来自LangChain/AutoGen/AWS Strands的agent进行无代码侵入的profiling,暴露latency bottleneck、token cost和orchestration overhead。在上线前通过真实session trace(如Claude Code 33分钟session的context growth图)验证系统行为,是保证agentic系统production economics可行的必要步骤。


架构图(nvidia-extreme-co-design-agentic-systems)

→ 查看可交互 HTML 版本