nvidia edge first llms av robotics¶
Ch01.194 nvidia edge first llms av robotics¶
📊 Level ⭐⭐ | 19.7KB |
entities/nvidia-edge-first-llms-av-robotics.md
Build Next-Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics | NVIDIA Technical Blog¶
Build Next-Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics | NVIDIA Technical Blog DEVELOPER Home Blog Forums Docs Downloads Training Join Technical Blog Subscribe Related Resources Developer Tools & Techniques Build Next-Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics NVIDIA TensorRT Edge LLM introduces support for MoEs, Cosmos Reason 2, and Qwen3-TTS/ASR on NVIDIA Jetson and NVIDIA DRIVE Mar 12, 2026 By Lin Chai , Luxiao Zheng , Fan Shi , Maximilien Breughe and Michael Ferry Like Discuss (0) L T F R E AI-Generated Summary Like Dislike The latest release of NVIDIA TensorRT Edge-LLM introduces advanced support for mixture of experts (MoE), hybrid reasoning architectures, and the NVIDIA Nemotron family on embedded platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor, enabling high-fidelity, low-latency autonomous machine intelligence within strict power constraints. Native multimodal interaction is achieved through optimized Qwen3-TTS and Qwen3-ASR models, allowing end-to-end, low-latency voice dialogue with a Thinker-Talker framework, and Cosmos Reason 2 enables advanced spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robotics and embodied agents at the edge. NVIDIA Alpamayo integration supports end-to-end trajectory planning in autonomous vehicles, employing flow matching trajectory decoding, explainable decision-making with multicamera context, and FP8-accelerated Vision Transformers, marking a shift from modular stacks to production-ready, reasoning-based VLA models. AI-generated content may summarize information incompletely. Verify important information. Learn more Physical AI is rapidly evolving, from next-generation software-defined autonomous vehicles (AVs) to humanoid robots. The challenge is no longer how to run a large language model (LLM), but how to enable high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within strict power and latency envelopes. NVIDIA TensorRT Edge-LLM , a high-performance C++ inference runtime for LLMs and vision language models (VLMs) on embedded platforms, is designed to overcome these challenges. As explained in this post, the latest TensorRT Edge-LLM release delivers a significant expansion in fundamental capabilities for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms. It introduces advanced edge architectures, including mixture of experts (MoE) , the NVIDIA Cosmos Reason 2 open planning model for physical AI, and Qwen3-TTS and Qwen-ASR models for embedded speech processing. Building on these foundational pillars, the release also offers optimized support for the NVIDIA Nemotron family of open models. This provides developers with the essential runtime to build the next generation of autonomous machines. Efficient reasoning at scale Running massive models on embedded hardware requires a rethink of compute efficiency. The latest release of TensorRT Edge-LLM fully enables MoE support at the edge, specifically optimizing models like Qwen3 MoE. By activating only a subset of expert parameters per token, MoE architectures enable edge devices to access the reasoning capabilities of a massive model while maintaining the inference latency and active compute footprint of a much smaller one. This architectural shift is critical for deploying high-fidelity reasoning on edge platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. As a developer, you can drastically scale up the intelligence of your autonomous systems without exceeding the strict power and latency limits required for real-time, mission-critical operations. Unlock hybrid reasoning at the edge TensorRT Edge-LLM is a specialized runtime to fully support NVIDIA Nemotron 2 Nano . This enables a new class of System 2 reasoning directly on embedded chipsets, including NVIDIA DRIVE Thor and Jetson Thor. For developers building advanced in-cabin AI assistants or robotic dialogue agents, deploying highly capable language models at the edge presents a significant memory and latency challenge. Nemotron 2 Nano addresses this challenge fundamentally by utilizing a novel Hybrid Mamba-2-Transformer architecture. This significantly reduces the memory footprint from KV cache storage with Mamba State Space architectures while maintaining high-fidelity precision from attention layers. TensorRT Edge-LLM bridges the deployment gap by providing optimized kernels that accelerate these specific hybrid layers. This enables developers to use the model’s massive context window for complex edge retrieval-augmented generation (RAG) pipelines or agentic workflows while maintaining a strict, production-viable device memory footprint. By enabling dynamic thinking at the edge with TensorRT Edge-LLM, developers can leverage a model s ability to shift seamlessly between deep reasoning and immediate conversational action. This is a critical capability for advanced in-cabin assistants and robotic agents that must reason through complex user queries one moment and provide conversational responses the next. Deep reasoning mode ( /think ) : TensorRT Edge-LLM efficiently handles the expanded token generation required for chain of thought (CoT) processing. By using the /think system prompt, the runtime enables the model to think through complex logic, achieving a remarkable 97.8% on MATH500 before outputting a decision. Conversational reflex mode ( /no_think ) : For latency-critical voice interactions where the user expects an immediate reply, developers can issue a /no_think command. TensorRT Edge-LLM optimizes this path to bypass reasoning traces, delivering immediate, intelligent responsiveness required for seamless conversational AI and agile on-device agents. By supporting this hybrid architecture, TensorRT Edge-LLM enables compact, production-ready VLMs and LLMs to serve as both reasoned assistants and low-latency conversational agents, significantly reducing the memory constraints of physical AI. Real-time multimodal interaction at the edge TensorRT Edge-LLM now offers support for Qwen3-TTS and Qwen3-ASR , a native multimodal model with Thinker-Talker architecture capable of voice interaction. Unlike traditional pipelines that cascade ASR, LLM, and TTS models, adding latency at every hop, Qwen3-TTS/ASR handles end-to-end speech processing. By optimizing both the Thinker and Talker components, TensorRT Edge-LLM enables low-latency, natural voice synthesis directly on the chip: Thinker : TensorRT Edge-LLM accelerates the reasoning core, allowing the model to process complex driver queries and environment context to generate intelligent, reasoned responses. Talker : TensorRT Edge-LLM complements the reasoning engine by delivering low latency, natural voice synthesis (TTS) directly on the chip. In the case of AVs, this allows for seamless, interruptible conversations between the driver and the vehicle. Equipping humanoid robotics with physical common sense For humanoid robots and advanced vision agents, understanding the real world requires more than just identifying objects; it requires an intuitive grasp of physics and time. To meet this need, TensorRT Edge-LLM now supports Cosmos Reason 2 , an open, customizable reasoning VLM purpose-built for physical AI and robotics. Cosmos Reason 2 empowers embodied agents to reason like humans by using prior knowledge, physical common sense, and chain-of-thought capabilities to understand world dynamics without human annotations. With TensorRT Edge-LLM optimized, low-latency runtime, robots at the edge can efficiently leverage Cosmos Reason 2 as a primary planning model to reason through their next steps. Key capabilities of Cosmos Reason 2 accelerated by TensorRT Edge-LLM include: Advanced spatio-temporal reasoning : Enhanced physical AI reasoning with improved timestamp precision and a deep understanding of space, time, and fundamental physics. 3D localization and explanation : The ability to not only detect objects but also provide 2D and 3D point localization, bounding-box coordinates, and contextual reasoning explanations for its labels. Massive context processing : Support for an improved long-context window of up to 256K input tokens, allowing edge agents to ingest extensive environmental and historical data. By supporting Cosmos Reason 2, TensorRT Edge-LLM ensures that next-generation robots can continuously evaluate complex, long-tail physical scenarios and safely plan their actions in real time. Advancing autonomous driving with end-to-end trajectory planning Among the most significant shifts in autonomous production is the move from traditional modular stacks to end-to-end VLA models. NVIDIA Alpamayo is a family of open AI models, simulation frameworks, and physical AI datasets designed to accelerate the development of safe, transparent, and reasoning-based AVs. Stay tuned for the forthcoming Alpamayo 1 workflow, a distillation recipe that brings System 2 rational thinking to the edge. Alpamayo 1 represents a leap forward from standard VLMs. It is not just describing a scene; it is planning a precise trajectory through it. The architecture utilizes a Cosmos Reason Backbone (distilled) to generate a chain of causation (reasoning trace) before outputting actions. Key features of the Alpamayo integration in TensorRT Edge-LLM include: Flow matching trajectory decoding : Moving beyond simple regression, flow matching is used to generate diverse, high-fidelity future trajectories. History and context : The model tokenizes two-second historical trajectories and multicamera inputs, processing them through a Qwen3-VL backbone to output explainable driving decisions. For example, “Nudge to the left to increase clearance. Performance : On DRIVE Thor, Alpamayo 1 achieves production-viable latencies, using FP8 acceleration for the Vision Transformer (ViT) components. Figure 1. The most significant shift in autonomous vehicle production is the transition from traditional modular stacks to end-to-end VLA models Get started with TensorRT Edge-LLM for physical AI TensorRT Edge-LLM serves as the go-to-open-source, pure C++ inference runtime designed specifically for the mission-critical needs of automotive and robotics. It eliminates Python dependencies for deployment, ensuring predictable memory footprints. From deploying the efficient expert routing of Qwen3 MoE today, to preparing for the future distilled reasoning of Alpamayo 1, NVIDIA provides the essential runtime to build the next generation of autonomous machines. To get started, explore the new features, including the Alpamayo and MoE examples, in the updated TensorRT Edge-LLM GitHub repo or through the latest NVIDIA DriveOS releases. Discuss (0) Like Tags Developer Tools & Techniques | Edge Computing | Robotics | Automotive / Transportation | Cosmos | DRIVE | Jetson | Nemotron | TensorRT | TensorRT-LLM | Intermediate Technical | Deep dive | AI Inference | autonomous vehicles | GTC 2026 | IoT | LLMs | Mixture of Experts (MoE) | Physical AI | Retrieval Augmented Generation (RAG) | Thor | VLMs About the Authors About Lin Chai Lin Chai is a senior product manager at NVIDIA, leading TensorRT and TensorRT Edge-LLM, NVIDIA s AI inference platforms for deep learning across datacenter and embedded platforms. Drawing on her background in autonomous driving and automotive OEMs, she is inspired to build production-grade inference systems that deliver best-in-class performance for deep learning workloads across data center, edge, and physical AI applications enabling systems that perceive, reason, and act in the real world. View all posts by Lin Chai About Luxiao Zheng Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis. View all posts by Luxiao Zheng About Fan Shi Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois. View all posts by Fan Shi About Maximilien Breughe Maximilien Breughe is an engineering leader and software engineer at NVIDIA, where he works on AI inference systems and edge AI technologies. He has a background in deep learning libraries and performance engineering, and holds a PhD in Computer Architecture focused on performance simulation techniques. Maximilien is especially interested in building practical, high-performance AI systems that bridge research and real-world deployment. View all posts by Maximilien Breughe About Michael Ferry Michael Ferry is a software engineering manager on the NVIDIA TensorRT team, where he leads the TensorRT Edge-LLM, Automotive Safety, and New Platforms teams. His work centers on optimized, reliable AI inference for safety-critical robotics and automotive edge systems. Before joining NVIDIA in 2018, Michael created and led several floating-point-focused verification tools at Intel. He holds a PhD in Mathematics, specializing in numerical optimization, from the University of California, San Diego. View all posts by Michael Ferry Comments Related posts Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72 NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI Getting the Best Performance on MLPerf Inference 2.0 Getting the Best Performance on MLPerf Inference 2.0 Related posts Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments Build a More Secure, Always-On Local AI Agent with OpenClaw and NVIDIA NemoClaw Build a More Secure, Always-On Local AI Agent with OpenClaw and NVIDIA NemoClaw Bringing AI Closer to the Edge and On-Device with Gemma 4 Bringing AI Closer to the Edge and On-Device with Gemma 4 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere L T F R E
深度分析¶
1. MoE 成为边缘端物理 AI 的关键架构转型
TensorRT Edge-LLM 全面支持 MoE 模型(如 Qwen3 MoE)在 DRIVE AGX Thor 和 Jetson Thor 等边缘嵌入式平台运行。通过 per-token 激活子集专家参数,MoE 架构使边缘设备能以小型模型的推理延迟和计算量访问超大模型的能力。这对在严格功耗和延迟约束下部署高保真推理的自动驾驶和机器人应用至关重要。
2. Hybrid Mamba-2-Transformer 实现边缘动态推理模式切换
Nemotron 2 Nano 采用 Hybrid Mamba-2-Transformer 架构,通过 Mamba State Space 架构显著降低 KV cache 内存占用,同时通过 Attention 层保持高保真精度。配合 /think(深度推理)和 /no_think(对话反射)模式动态切换,在 MATH500 上达到 97.8% 准确率,同时支持即时语音交互,为车载助手和机器人对话代理开辟了新可能。
3. 端到端语音交互消除多跳延迟,Thinker-Talker 架构成边缘多模态新范式
Qwen3-TTS/ASR 通过 Thinker-Talker 架构实现原生多模态语音交互。与传统的级联 ASR→LLM→TTS 管道不同,Qwen3 端到端处理语音,Thinker 加速推理核心处理复杂查询,Talker 芯片级低延迟语音合成,使车内对话可中断、无缝切换。
4. Cosmos Reason 2 为具身智能赋予物理常识和时空推理能力
Cosmos Reason 2 通过物理 AI 推理、改进时间戳精度和空间理解、2D/3D 定位与解释,以及最高 256K input token 的长上下文处理,使人形机器人能够持续评估复杂长尾物理场景并实时安全规划行动——从场景描述向轨迹规划的重要飞跃。
5. Alpamayo 标志着自动驾驶从模块化栈向端到端 VLA 的生产级迁移
Alpamayo 1 使用 Flow Matching 轨迹解码(超越简单回归)、历史轨迹和摄像头输入分词(Qwen3-VL 主干)、FP8 加速 Vision Transformer,在 DRIVE Thor 上实现生产级延迟。从传统模块化栈到端到端 VLA 的转变已不是研究演示,而是生产就绪的推理基础。
实践启示¶
-
边缘物理 AI 优先选择 MoE 架构:在边缘部署 AI 时,MoE 是突破功耗和延迟约束的关键——今天就应开始评估 Qwen3 MoE 在目标嵌入式平台上的适配性和专家路由效率。
-
利用混合推理架构实现场景自适应:Nemotron 2 Nano 的
/think//no_think模式切换能力,可直接应用于需要同时支持深度推理和即时响应的车载和机器人产品设计。 -
端到端语音优先于级联管道:对于延迟敏感的车内和机器人语音交互,采用 Thinker-Talker 等端到端原生多模态架构,而非 ASR-LLM-TTS 级联,以消除每跳延迟。
-
Cosmos Reason 2 是机器人规划模型的首选基准:在具身智能项目中,以 Cosmos Reason 2 作为主要规划模型,充分利用其物理常识和长上下文能力进行复杂场景推理。
-
VLA 端到端轨迹规划已成生产现实:APlamayo 表明端到端 VLA 生产级部署已到达临界点,自动驾驶团队应加速从模块化栈向 VLA 的技术路线迁移。
相关实体¶
- Nvidia Gemma 4 Edge Ai
- Vera Arrives Nvidia S First Cpu Built For Agents Lands At Top Ai Labs
- Nvidia Gpu Kernel Translation Cute Python Julia
- Nvidia Cut Checkpoint Costs Nvcomp
- Cong 30 Fen Zhong Shou Gu Agent Dao Harness Cheng Wei Xin Hou Duan
- MOC
→ 原文存档