Session
Research-Track Oral Presentation: R4: Compilers and Kernels
Grand Ballroom 2
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
Genghan Zhang ⋅ Shaowei Zhu ⋅ ⋅ ⋅ Allen Nie ⋅ Zhen Jia ⋅ Nandita Vijaykumar ⋅ Yida Wang ⋅ Kunle Olukotun
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.
Many compound AI systems are inherently “approximate” because the the ML components (e.g. a large language model) are probabilistic models and the non-ML components (e.g. retrieval-augmented generation) are heuristic. Such systems benefit from trading off result quality for improved performance. While extensive work exists on approximating ML and non-ML components individually, the wide deployment of LLMs in compound systems presents significant opportunities for end-to-end, accuracy-aware compilation. However, tailoring approximations across these different components is challenging to implement. This difficulty comes from their reliance on different software stacks for compilation and execution, as well as deployment on different hardware. To address these issues, we present ApproxMLIR, a reusable accuracy-aware compilation toolchain. ApproxMLIR introduces the approx MLIR dialect that serves as a unified and centralized interface for defining approximations and approx-opt, a reusable MLIR-based optimizer, which applies approximate transformations on ML and non-ML components. Our evaluation on three compound AI systems, which combine LLMs with information retrieval tasks and tool calling. The evaluation shows that ApproxMLIR can can effectively represent many common approximation choices, discover profitable points in the accuracy-performance space and consistently achieve higher speedups compared to static approximation strategies.
Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
Hongyi Jin ⋅ Bohan Hou ⋅ Guanjie Wang ⋅ Ruihang Lai ⋅ Jinqi Chen ⋅ Zihao Ye ⋅ Yaxing Cai ⋅ Yixin Dong ⋅ Xinhao Cheng ⋅ Zhihao Zhang ⋅ Yilong Zhao ⋅ Yingyi Huang ⋅ Lijie Yang ⋅ Jinchen Jiang ⋅ Gabriele Oliaro ⋅ ⋅ Xupeng Miao ⋅ Vinod Grover ⋅ Todd Mowry ⋅ Zhihao Jia ⋅ Tianqi Chen
Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.
HipKittens: Fast and Furious AMD Kernels
William Hu ⋅ Drew Wadsworth ⋅ Sean Siddens ⋅ ⋅ Daniel Fu ⋅ ⋅ Muhammad Osama ⋅ Christopher Ré ⋅ Simran Arora
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers — are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available baselines by $1.2 − 2.4\times$ ($d = 64$ attention, GQA non-causal backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels across GPU vendors.
ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels
Stuart H. Sul ⋅ Simran Arora ⋅ Benjamin Spector ⋅ Christopher Ré
Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance—data-transfer mechanisms, resource scheduling, and design overheads. With fewer than 50 lines of device code, PK achieves up to $2.33\times$ speedup for data- and tensor-parallel workloads, $4.08\times$ for sequence-parallel workloads, and $1.22\times$ for expert-parallel workloads.
SchedFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
Yi Pan ⋅ Yile Gu ⋅ Luo Jinbin ⋅ Yibo Wu ⋅ Ziren Wang ⋅ ⋅ Ziyi Xu ⋅ Shengkai Lin ⋅ Stephanie Wang ⋅ Baris Kasikci
Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose SchedFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. SchedFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that SchedFlow can integrate four representative parallelism strategies into three state-of-the-art ML systems (vLLM, SGLang, HuggingFace Transformer) with minimal code changes, achieving up to a 1.24x throughput improvement.