Skip to yearly menu bar Skip to main content


Session

Industry Track Oral Presentation: Compilers/HW

Grand Ballroom 1

Moderator: Phitchaya Phothilimthana

Fri 22 May 8:15 a.m. PDT — 9:15 a.m. PDT
Abstract:
Chat is not available.

Fri 22 May 8:15 - 8:30 PDT

Dataflow Is All You Need

Darshan Gandhi ⋅ Pushkar Nandkar ⋅ David Koeplinger ⋅ Nasim Farahini ⋅ Romy Tsoupidi ⋅ Samuel Rydh ⋅ Matheen Musaddiq ⋅ Tuowen Zhao ⋅ Reid Goodbar ⋅ Nathan Sheeley ⋅ Leon Zhang ⋅ Matthew Shaffer ⋅ John Long ⋅ Han Wang ⋅ Angela Wang ⋅ Arjun Sabnis ⋅ Joshua Brot ⋅ Yun Du ⋅ Håkan Zeffer ⋅ Mingran Wang ⋅ Raghu Prabhakar

The autoregressive decode phase of token generation is often a major performance bottleneck in modern AI workflows due to its memory-bandwidth-bound characteristics, which are amplified by powerful open-source models with large context windows and techniques such as chain-of-thought reasoning. Popular GPU architectures extract as little as 21\% of the available memory bandwidth from loading weights and KV caches, as the scope of asynchronous execution is limited by CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different model architectures. We argue that to truly mitigate these overheads, $\textit{Dataflow Is All You Need}$. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU), substantiated by three key dataflow enabled optimizations -- $\textit{KernelLooping}$, $\textit{BatchStreaming}$, and $\textit{ScheduleOffloading}$ -- that generalize to models that are small, large, dense, MoEs, or hybrids, and that use different attention mechanisms. Collectively, these optimizations deliver more than $\textbf{75}$\% of the theoretical peak roofline performance for a wide range of popular open-source models and demonstrate a speedup of more than $\textbf{6$\times$}$ when using popular speculative decoding techniques. Finally, we show that speculative decoding is $\textbf{1.7$\times$}$ faster on 16 SN40 ~chips than on a DGX H100 despite both systems having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at $\href{}{cloud.sambanova.ai}$.

Fri 22 May 8:30 - 8:45 PDT

Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande ⋅ Deep Shekhar ⋅ Marc Blackstein ⋅ Ram Rangan

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. This means efficient support for: a) interactive as well as batch modes, b) high-resolution VLM inference, c) dense and mixture-of-experts (MoE) LLMs, and d) adapting to system conditions (CPU thread count, CPU-GPU interconnect bandwidth, and video memory (VRAM) budget) and inference conditions (phase of execution and context size). While recent CPU-GPU hybrid scheduling techniques show promise, to the best of our knowledge, no single product handles all of the above. In this paper, we address this problem with *pipelined sharding*, a novel, benchmark-profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-of-experts (MoE) LLMs. Using a combination of model sharding at the sub-layer level, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called *VLMOpt*), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing software development kit (IGI SDK) and the Cosmos-Reason1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: for interactive use, TTFT improves by up to 6.7$\times$ and TPS by up to 30$\times$ for LLMs, and CR1 inference’s VRAM demand is down by 10$\times$, while in batched mode, throughput improves by up to 8.2$\times$, all compared to their respective aggressive baselines.

Fri 22 May 8:45 - 9:00 PDT

Wave: A Symbolic Python DSL And Compiler for High-Performance Machine Learning

Harsh Menon ⋅ Oleksandr Zinenko ⋅ Gaurav Verma ⋅ Stanley Winata ⋅ Ivan Butygin ⋅ Nithin Meganathan ⋅ Sanket Pandit ⋅ William Gallard Hatch ⋅ Surya Jasper ⋅ Megan Kuo ⋅ Sahil FAIZAL ⋅ Ashay Rane ⋅ Aurore De Spirlet ⋅ Martin P. Lücke

Modern ML models demand ever-greater compute, prompting hardware vendors to add specialized matrix cores to their GPUs. While these units unlock high throughput, they impose intricate programming models and addressing schemes that are difficult to manage by hand. This paper introduces Wave, a Python-embedded DSL for kernel authoring that automates these complex address computations and lets authors focus on core computation. In experiments, it matches or surpasses the performance of state-of-the-art kernel DSLs and libraries.

Fri 22 May 9:00 - 9:15 PDT

CATWILD: Compiler Autotuning for TPU workloads in the Wild

Ignacio Cano ⋅ Yu Wang ⋅ Mike Burrows ⋅ Ziqiang Feng ⋅ Matheus Camargo ⋅ Chao Wang ⋅ David Liu ⋅ Tengyu Sun ⋅ Alexander Wertheim ⋅ Arissa Wongpanich ⋅ Christof Angermueller ⋅ Hyojun Kim ⋅ Wenqi Cao ⋅ Aleksey Orekhov ⋅ Amit Sabne ⋅ Emma Sevastian ⋅ Mehrdad Khani ⋅ Karthik Murthy ⋅ Berkin Ilbeyi ⋅ Subhankar Shah ⋅ Ryan Lefever ⋅ Arjun Khare ⋅ Ankit Sinha ⋅ Peter Ma ⋅ Matt Bierbaum ⋅ Jeremiah Wilke ⋅ Emily Donahue ⋅ Sami Abu-El-Haija ⋅ Nikhil Sarda ⋅ Vineetha Govindaraj ⋅ Shobha Vasudevan ⋅ Kirill Gugaev ⋅ Idan Nachman ⋅ Jie Sun ⋅ Jose Baiocchi Paredes ⋅ Samrat Ghosh ⋅ Domagoj Babic ⋅ Zongwei Zhou ⋅ Naveen Kumar ⋅ Phitchaya Phothilimthana

Compilers play a fundamental role at achieving peak performance for machine learning (ML) workloads. However, given the diverse nature of workloads and accelerators, compilers’ heuristics and analytical cost models can result in sub-optimal performance, and thus waste precious datacenter resources. Furthermore, the multitude of tunable parameters and their complex interplay often make it impossible for human experts to manually find optimal configurations. In this paper, we present CATWILD, a system that automatically optimizes ML jobs in Google’s TPU fleet using compiler autotuning techniques. We describe CATWILD’s design and implementation, and evaluate its performance using a handful of representative metrics. We further report experiences and lessons learned from its five-year development and operation. To the best of our knowledge, CATWILD represents the first ML compiler autotuning solution deployed in datacenters at scale. Its successful rollout yielded substantial benefits, generating tuned configurations for a large portion of Google’s TPU training workloads and achieving significant chip savings.