Track: Research Track Oral Presentation: Compilers and Kernels

Thu 21 May 16:30 - 16:45 PDT

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul ⋅ Simran Arora ⋅ Benjamin Spector ⋅ Christopher Ré

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance—data-transfer mechanisms, resource scheduling, and design overheads. We validate PK on both Hopper and Blackwell architectures. With fewer than 50 lines of device code, PK achieves up to $2.33\times$ speedup for data- and tensor-parallel workloads, $4.08\times$ for sequence-parallel workloads, and $1.22\times$ for expert-parallel workloads.

Thu 21 May 16:45 - 17:00 PDT

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Hongyi Jin ⋅ Bohan Hou ⋅ Guanjie Wang ⋅ Ruihang Lai ⋅ Jinqi Chen ⋅ Zihao Ye ⋅ Yaxing Cai ⋅ Yixin Dong ⋅ Xinhao Cheng ⋅ Zhihao Zhang ⋅ Yilong Zhao ⋅ Yingyi Huang ⋅ Lijie Yang ⋅ Jinchen Jiang ⋅ Gabriele Oliaro ⋅ Jianan Ji ⋅ Xupeng Miao ⋅ Vinod Grover ⋅ Todd Mowry ⋅ Zhihao Jia ⋅ Tianqi Chen

Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.

Thu 21 May 17:00 - 17:15 PDT

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang ⋅ Shaowei Zhu ⋅ Anjiang Wei ⋅ Zhenyu Song ⋅ Allen Nie ⋅ Zhen Jia ⋅ Nandita Vijaykumar ⋅ Yida Wang ⋅ Kunle Olukotun

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over iterations, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.

Thu 21 May 17:15 - 17:30 PDT

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Yi Pan ⋅ Yile Gu ⋅ Luo Jinbin ⋅ Yibo Wu ⋅ Ziren Wang ⋅ Hongtao Zhang ⋅ Ziyi Xu ⋅ Shengkai Lin ⋅ Baris Kasikci ⋅ Stephanie Wang

Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose DynaFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. DynaFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that DynaFlow can integrate representative parallelism strategies into 6 state-of-the-art ML systems with minimal code changes, achieving up to a 1.29x throughput improvement. DynaFlow is publicly available at https://github.com/uw-syfi/DynaFlow.

Thu 21 May 17:30 - 17:45 PDT

ApproxMLIR : Accuracy-Aware Compiler for Compound ML System

Hao Ren ⋅ Yi Mu ⋅ Sasa Misailovic

Many compound AI systems are inherently approximate because the ML components (e.g., a large language model) are probabilistic models and the non-ML components (e.g., retrieval-augmented generation) are heuristic. Such systems benefit from trading off result quality for improved performance. While extensive work exists on approximating ML and non-ML components individually, the wide deployment of LLMs in compound systems presents significant opportunities for end-to-end, accuracy-aware compilation. However, tailoring approximations across these different components is challenging to implement. This difficulty comes from their reliance on different software stacks for compilation and execution, as well as deployment on different hardware. To address these issues, we present approxMLIR, a reusable accuracy-aware compilation toolchain. approxMLIR introduces the approx MLIR dialect that serves as a unified and centralized interface for defining approximations and approx-opt, a reusable MLIR-based optimizer, which applies approximate transformations on ML and non-ML components. We evaluated approxMLIR on three compound AI systems, which combine LLMs with information retrieval tasks and tool calling. The evaluation shows that approxMLIR can effectively represent many common approximation choices, discover profitable points in the accuracy-performance tradeoff space and consistently achieve higher speedups compared to static approximation strategies.

Thu 21 May 17:45 - 18:00 PDT

HipKittens: Fast and Furious AMD Kernels

William Hu ⋅ Drew Wadsworth ⋅ Sean Siddens ⋅ Stanley Winata ⋅ Daniel Fu ⋅ Ryan Swann ⋅ Muhammad Osama ⋅ Christopher Ré ⋅ Simran Arora

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers — are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available baselines by $1.2 − 2.4\times$ ($d = 64$ attention, GQA non-causal backwards, memory-bound kernels). HK demonstrates a portable tile-based programming model that abstracts over hardware-specific implementation details. These findings help pave the way for a single, tile-based software layer for high-performance AI kernels across GPU vendors. HipKittens has been productionalized in AITER and is released at: https://github.com/HazyResearch/HipKittens.

Main Navigation

Session