nvidia gpu kernel translation cute python julia¶

Ch03.054 nvidia gpu kernel translation cute python julia¶

📊 Level ⭐⭐ | 17.4KB | entities/nvidia-gpu-kernel-translation-cute-python-julia.md

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl | NVIDIA Technical Blog¶

Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl | NVIDIA Technical Blog DEVELOPER Home Blog Forums Docs Downloads Training Join Technical Blog Subscribe Related Resources Developer Tools & Techniques Automating GPU Kernel Translation with AI Agents: cuTile Python to cuTile.jl Apr 30, 2026 By Zhengyi Zhang , Yifei Song and Tim Besard Like Discuss (0) L T F R E AI-Generated Summary Like Dislike NVIDIA CUDA Tile (cuTile) enables tile-based GPU kernel programming, and cuTile.jl brings this abstraction to Julia, allowing custom GPU kernels without using CUDA C++, critical for Julia's scientific computing ecosystem. Translating GPU kernels from cuTile Python to cuTile.jl involves handling key semantic differences such as 0-based vs. 1-based indexing, row-major vs. column-major memory layout, broadcasting syntax, and kernel API mappings, which if mishandled, cause silent errors. The TileGym project developed an AI-driven skill-based workflow that encodes 17 critical translation rules, static validation scripts, and example kernels (add, matmul, softmax), enabling automated, repeatable, and validated conversion of cuTile Python kernels to Julia with minimal manual effort. AI-generated content may summarize information incompletely. Verify important information. Learn more NVIDIA CUDA Tile (cuTile) is a tile-based programming model that enables developers to write GPU kernels in terms of tile-level operations loads, stores, and matrix multiply-accumulate rather than manually coordinating threads, warps, and shared memory. cuTile.jl brings the same tile-based approach to the dynamic programming language Julia . Users can write custom GPU kernels without dropping down to NVIDIA CUDA C++. Custom kernels are often essential in Julia s scientific computing ecosystem spanning differential equations, probabilistic programming, and physics simulations. cuTile Python has a growing library of optimized kernels for GPU acceleration. The ability to translate those kernels to cuTile.jl provides the Julia ecosystem with immediate access to battle-tested implementations, instead of rewriting each one from scratch. This post covers cross-domain-specific language (DSL) GPU kernel translation, from porting cuTile Python kernels to cuTile.jl (Julia). It shows how to: Translate GPU kernels between cuTile Python and cuTile.jl: Walk through a complete matrix multiplication example side-by-side. Avoid semantic traps that break naive translations: Indexing, broadcasting, memory layout, and loop forms all diverge between the two DSLs and silent mismatches produce wrong results, not compiler errors. Build a repeatable, skill-driven AI workflow: The translation knowledge is packaged into an LLM skill in TileGym that produces validated Julia kernels in a single pass, systematizing a one-off porting effort. Cross-DSL GPU kernel translation Both cuTile Python and cuTile.jl frontends share the same tiled abstraction, making the translation largely algorithmic. However, the cumulative surface-level differences between the two languages are non-trivial, as shown in Table 1. Category Python (cuTile) Julia (cuTile.jl) Indexing 0-based ( ct.bid(0) ) 1-based ( ct.bid(1) ) Broadcasting Implicit ( a + b ) Explicit dot syntax ( a .+ b ) Memory layout Row-major Column-major Kernel definition @ct.kernel decorator Plain function ... end Constants param: ct.Constant[int] in signature param::Int in signature, ct.Constant(val) at launch Type conversion tile.astype(ct.float32) convert(ct.Tile{Float32}, tile) Matrix multiply ct.mma(a, b, acc=acc) muladd(a, b, acc) Table 1. High-level differences between writing tile code in Python versus Julia None of these translations are conceptually difficult, but miss one ct.bid(0) that should be ct.bid(1) , and you get silent data corruption. Use * instead of . for element-wise multiply, and Julia silently does a matrix multiply instead. These are the kinds of bugs that waste hours. A shared abstraction with a finite set of recurring pitfalls is well-suited for an AI-assisted workflow if the model is taught what to watch out for. Translating cuTile Python to cuTile.jl The process is best understood through actual code. The following examples are from TileGym, where the team ported a set of cuTile Python kernels to cuTile.jl and packaged them as a self-contained Julia subproject. Matrix multiplication example The running example uses matmul, which is complex enough to show key translation challenges. Beyond basic syntax differences, the translation must handle loop structure, TF32 tensor core conversion, and the shift from row-major to column-major layout. cuTile Python: @ct.kernel def matmul_kernel(A, B, C, tm: ct.Constant[int], tn: ct.Constant[int], tk: ct.Constant[int]): bid_m = ct.bid(0) bid_n = ct.bid(1) num_k = ct.num_tiles(A, axis=1, shape=(tm, tk)) acc = ct.full((tm, tn), 0, dtype=ct.float32) dtype = ct.tfloat32 if A.dtype == ct.float32 else A.dtype for k in range(num_k): a = ct.load(A, index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.ZERO) b = ct.load(B, index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.ZERO) a = a.astype(dtype) b = b.astype(dtype) acc = ct.mma(a, b, acc) acc = ct.astype(acc, C.dtype) ct.store(C, index=(bid_m, bid_n), tile=acc) cuTile.jl (Julia): function matmul_kernel(A::ct.TileArray{T,2}, B::ct.TileArray{T,2}, C::ct.TileArray{T,2}, tm::Int, tn::Int, tk::Int) where {T} bid_m = ct.bid(1) bid_n = ct.bid(2) num_k = ct.num_tiles(A, 2, (tm, tk)) acc = zeros(Float32, tm, tn) U = T === Float32 ? ct.TFloat32 : T for k in Int32(1):num_k a = ct.load(A; index=(bid_m, k), shape=(tm, tk), padding_mode=ct.PaddingMode.Zero) b = ct.load(B; index=(k, bid_n), shape=(tk, tn), padding_mode=ct.PaddingMode.Zero) a = convert(ct.Tile{U}, a) b = convert(ct.Tile{U}, b) acc = muladd(a, b, acc) end acc = convert(ct.Tile{T}, acc) ct.store(C; index=(bid_m, bid_n), tile=acc) return end Beyond the basic syntax changes, note the following: The layout flips: The Python row-major A(M,K) becomes column-major A_jl(K,M) in Julia. The accumulator, load indices, and store indices all change accordingly. Get the accumulator shape wrong say (TM, TN) instead of (TN, TM) and you get wrong results with no compiler warning. ct.mma muladd: cuTile.jl maps matrix multiply-accumulate to the Julia standard muladd , and ct.PaddingMode.ZERO becomes ct.PaddingMode.Zero (PascalCase). Softmax example Softmax pushes things further. Three strategies were implemented in Julia tensor memory accelerator (TMA) single-tile, online, and chunked to handle different tensor sizes. On top of the matmul patterns, the softmax function brings in broadcast dot syntax ( ct.exp(ct.sub(a, b) ) exp.(a .- b) ), renamed reductions ( ct.max maximum , ct.sum sum , axis +1), and element-wise ct.maximum(a, b ) max.(a, b) . But the real challenge isn’t syntax it’s maintaining correct running max/sum statistics through the translation. Workflow generation with agent skills The primary outcome of this project wasn’t the translated kernels it was the skill built to produce them. Figure 1. The conversion skill packages translation rules, API mappings, examples, validation, and tests into a single reusable workflow A skill, in this context, is a directory of structured knowledge that lives in the repository and is picked up by an LLM agent. The path to this particular skill is: .claude/skills/converting-cutile-to-julia/ . .claude/skills/converting-cutile-to-julia/ SKILL.md # Entry point: workflow overview, top pitfalls translations/ workflow.md # Step-by-step conversion with checklists references/ api-mapping.md # Bidirectional Python Julia API table critical-rules.md # 17 rules (indexing, broadcasting, loops, ...) debugging.md # Error diagnosis for MethodError, IRError, etc. testing.md # Test patterns, tolerances per dtype scripts/ validate_cutile_jl.py # Static checker for common anti-patterns examples/ 01_add/ # Python Julia for vector addition 02_matmul/ # Python Julia for matrix multiply 03_softmax/ # Python Julia for softmax (3 strategies) The critical-rules.md alone captures 17 pitfalls the team encountered. Table 2 details the most common pitfalls and the associated fixes. # Pitfall Fix 1 max(a, b) on tiles IRError Use max.(a, b) (broadcast dot) 2 ct.load with order index positions wrong order remaps BOTH shape AND index Table 2. Pitfalls and associated fixes for some of the more common issues encountered There’s also a static validator script that catches things like leftover ct.bid(0) , for loops inside kernels, and Python-style type names before running on the GPU. With all of this in place, the model doesn’t have to rediscover the conversion rules each time. It reads the skill, follows the checklist, and applies the rules. The AI agent skill in TileGym The concrete deliverable is a Julia subproject under julia/ in TileGym, which is open source: julia/ Project.toml # Dependencies: CUDA.jl, cuTile.jl, NNlib.jl, Test kernels/ add.jl # 1D element-wise with alpha scaling matmul.jl # 2D tiled MMA with column-major layout softmax.jl # 3 strategies: TMA, online, chunked test/ runtests.jl # Test runner test_add.jl test_matmul.jl test_softmax.jl These three kernels were deliberately selected. Kernel add is the simplest method to test the full translation surface. Matmul adds loop structure, tensor cores, and the layout flip. Softmax introduces multipass algorithms with invariants that have to survive translation. Each kernel has tests that compare against a CPU reference with per-dtype tolerances, including boundary cases where dimensions don’t align to tile sizes. Results and lessons learned With the skill in place, the workflow for each kernel looked like this: Pre-flight : Scan the source for patterns that require special handling ( for loops, ct.mma , order= , and so on). Convert : Apply the API mapping and critical rules. Validate : Run the static checker. Test : Run Julia tests against reference implementations. Fix : If something fails, use the debugging guide, fix, and rerun. For a representative general matrix multiply (GEMM) conversion, the process took about 4 minutes and ~78K tokens on a frontier LLM with no manual intervention. Subsequent kernels were faster because the examples and rules were already in the repo. Table 3 lists the pitfalls that caused bugs during ports, all of which are now handled automatically in the skills. Pitfall Symptom Root cause ct.bid(0) left unchanged Wrong tile loaded, silent corruption 0-based versus 1-based indexing a * b for element-wise multiply Matrix multiply instead of element-wise Julia * is matmul; need . Accumulator shape (TM, TN) Wrong results in matmul Column-major needs (TN, TM) ct.PaddingMode.ZERO UndefVarError Julia uses PascalCase: .Zero Table 3. Common pitfalls, symptoms, and root causes that cause bugs during the porting of tile code from Python to Julia The takeaway isn’t that AI wrote the code. It’s the ability to capture what was learned into something the model can reuse next time. A prompt can say, “Be careful with indexing.” A skill can say, “Here are the 17 specific things that go wrong, here’s how to check for them, and here’s a script that catches them automatically.” Now, future ports can start from a repo that already has working examples, a tested API mapping, a static validator, and a debugging guide. Each one takes less effort than the last. A broader takeaway is that the challenge in using AI for systems work isn’t code generation it’s producing correct code in domains where the compiler won’t catch semantic mistakes. Encoding domain rules in version control, alongside the code they describe, is one way to address this. Get started using agent skills to translate Python kernels to Julia Use the following code to try the Julia subproject and the conversion skill: cd TileGym # Explore the Julia kernels ls julia/kernels/ # add.jl, matmul.jl, softmax.jl # Explore the conversion skill ls .claude/skills/converting-cutile-to-julia/ # Install Julia dependencies (requires Julia 1.12+, CUDA 13.1+ driver) julia --project=julia/ -e 'using Pkg; Pkg.instantiate()' # Run the Julia kernel tests julia --project=julia/ julia/test/runtests.jl Requirements: Julia 1.12+ and NVIDIA CUDA 13.1+ driver NVIDIA Ampere, NVIDIA Ada, or NVIDIA Blackwell GPU (compute capability 8.x, 10.x, 11.x, 12.x) An LLM agent with file system access (for example, Claude Code ). To use the conversion skill for your own kernels, point your LLM agent at .claude/skills/converting-cutile-to-julia/SKILL.md, provide a cuTile Python kernel as input, and start translating Python kernels to Julia. Discuss (0) Like Tags Data Science | Developer Tools & Techniques | General | CUDA | Intermediate Technical | Tutorial | CUDA Tile | cuTile | featured About the Authors About Zhengyi Zhang Zhengyi Zhang is a computer architect intern at NVIDIA. He is currently a PhD candidate at Fudan University. Zhengyi's research interests span deep learning inference optimization, high-performance kernel development, and compiler techniques for deep learning workloads. View all posts by Zhengyi Zhang About Yifei Song Yifei Song is a computer architect at NVIDIA. He graduated from the University of Chinese Academy of Sciences. Yifei focuses on end-to-end training optimization, distributed model parallelism, and MLIR compiler infrastructure for deep learning systems. View all posts by Yifei Song About Tim Besard Tim Besard is a software engineer at JuliaHub, where he leads GPU support and development for the Julia programming language. He holds a Ph.D. in computer science engineering from Ghent University, Belgium, and has been a key contributor to Julia's GPU ecosystem since 2014. Tim maintains several foundational GPU packages, including CUDA.jl, GPUArrays.jl, GPUCompiler.jl, and LLVM.jl, which together form the backbone of GPU computing in Julia. View all posts by Tim Besard Comments Related posts cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels CUTLASS: Principled Abstractions for Handling Multidimensional Data Through Tensors and Spatial Microkernels Introducing Tile-Based Programming in Warp 1.5.0 Introducing Tile-Based Programming in Warp 1.5.0 High-Performance GPU Computing in the Julia Programming Language High-Performance GPU Computing in the Julia Programming Language High-Performance GPU Computing in the Julia Programming Language High-Performance GPU Computing in the Julia Programming Language Related posts How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car How to Build In-Vehicle AI Agents with NVIDIA: From Cloud to Car Building for the Rising Complexity of Agentic Systems with Extreme Co-Design Building for the Rising Complexity of Agentic Systems with Extreme Co-Design Optimize Supply Chain Decision Systems Using NVIDIA cuOpt Agent Skills Optimize Supply Chain Decision Systems Using NVIDIA cuOpt Agent Skills How to Build, Run, and Scale High-Quality Creator Workflows in ComfyUI How to Build, Run, and Scale High-Quality Creator Workflows in ComfyUI Powering AI Factories with NVIDIA Enterprise Reference Architectures Powering AI Factories with NVIDIA Enterprise Reference Architectures L T F R E

深度分析¶

本文揭示了 {DOMAIN} 领域的核心发展趋势，对理解技术演进方向具有重要参考价值。

关键洞察¶

核心趋势：从多个维度的分析可以看出，行业正在经历从传统架构向智能系统的根本性转变
技术驱动因素：新型 AI 能力的引入正在重新定义产品形态和用户体验
商业影响：这一转变对现有市场格局和竞争态势产生深远影响

与行业整体趋势的关联¶

本文与同期发表的 System of Record→Intelligence 等文章共同构成了对 AI Native 时代企业软件演进的系统性分析框架

实践启示¶

架构评估：定期审视现有技术栈，判断是否需要进行智能化升级
渐进式迁移：采用增量式方法逐步引入新能力，降低迁移风险
数据基础设施：确保数据质量和结构化程度，为 AI 层提供可靠输入
团队能力建设：培养具备 AI 时代所需技能的工程团队