Session
Industry-Track Oral Presentation: I4: Compilers/HW
Grand Ballroom 1
CATWILD: Compiler Autotuning for TPU workloads in the Wild
Ignacio Cano ⋅ Yu Wang ⋅ Phitchaya Phothilimthana ⋅ Mike Burrows ⋅ ⋅ Matheus Camargo ⋅ Alexander Wertheim ⋅ Chao Wang ⋅ David Liu ⋅ Tengyu Sun ⋅ Arissa Wongpanich ⋅ Christof Angermueller ⋅ ⋅ ⋅ Vineetha Govindaraj ⋅ Amit Sabne ⋅ ⋅ ⋅ Berkin Ilbeyi ⋅ Ryan Lefever ⋅ Mehrdad Khani ⋅ Subhankar Shah ⋅ Ankit Sinha ⋅ ⋅ ⋅ ⋅ Nikhil Sarda ⋅ ⋅ ⋅ ⋅ Emily Donahue ⋅ Sami Abu-El-Haija ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Naveen Kumar
Compilers play a fundamental role at achieving peak performance for machine learning (ML) workloads. However, given the diverse nature of workloads and accelerators, compilers' heuristics and analytical cost models often result in sub-optimal performance, and thus waste precious datacenter resources. Furthermore, the multitude of tunable parameters and their complex interplay often make it impossible for human experts to manually find optimal configurations. In this paper, we present CATWILD, a system that automatically optimizes ML jobs in Google's TPU fleet using compiler autotuning techniques. We describe CATWILD’s design and implementation, and evaluate its performance using a handful of representative metrics. We further report experiences and lessons learned from its five-year development and operation. To the best of our knowledge, CATWILD represents the first ML compiler autotuning solution deployed in datacenters at scale. Its successful rollout yielded substantial benefits, optimizing over 70% of daily TPU training jobs and achieving significant chip savings.
Dataflow Is All You Need
Darshan Gandhi ⋅ Pushkar Nandkar ⋅ David Koeplinger ⋅ ⋅ Romy Tsoupidi ⋅ ⋅ ⋅ Tuowen Zhao ⋅ Reid Goodbar ⋅ ⋅ Leon Zhang ⋅ ⋅ John Long ⋅ Han Wang ⋅ ⋅ ⋅ ⋅ Yun Du ⋅ Håkan Zeffer ⋅ ⋅ Raghu Prabhakar
The autoregressive decode phase of token generation is often the performance bottleneck in modern AI workflows, thanks to powerful open-source models with large context windows coupled with techniques like chain-of-thought reasoning. Decoding is memory bandwidth bound: the speed of token generation is limited by the memory bandwidth utilized to read weights and KV cache values. However, GPUs only use as little as 21\% of the available bandwidth on weights and KV caches. Asynchronous execution is hard on GPUs, which creates CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different types of model architectures. We argue that to truly mitigate these overheads, \emph{Dataflow Is All You Need}. Dataflow architectures execute subgraphs of operations asynchronously on one or more chips, thereby naturally mitigating the overhead faced on GPUs. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU). We describe three key optimizations enabled by dataflow -- \emph{\textbf{KernelLooping}}, \emph{\textbf{BatchStreaming}}, and \emph{\textbf{ScheduleOffloading}} -- that generalize over models that are small, large, dense, MoEs, hybrids, and with different attention mechanisms. Collectively, these optimizations deliver more than \textbf{75\%} of the theoretical peak roofline performance for a wide range of popular open-source models. We study speculative decoding in detail and demonstrate a speed-up of more than \textbf{6$\times$} with speculative decoding. Finally, we also show that speculative decoding runs \textbf{1.7$\times$} faster on 16 SN40 RDUs than DGX H100 despite having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at cloud.sambanova.ai.
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. This means efficient support for: a) interactive use (i.e. batch size 1), b) high resolution VLM inference, c) dense and mixture-of-experts (MoE) LLMs, and d) adapting to system conditions (CPU thread count, CPU-GPU interconnect bandwidth, and VRAM budget) and inference conditions (phase of execution and context size). While recent CPU-GPU hybrid scheduling techniques show promise, to our best knowledge, no single product handles all of the above. In this paper, we address this problem with pipelined sharding, a novel, benchmark profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-ofexperts (MoE) LLMs. Using a combination of model sharding at layer or sub-layer levels, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing (IGI) software development kit (SDK) and the Cosmos-Reason-1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: time-to-first-token (TTFT) improves by up to 6.7× and tokens per second by up to 30× for LLMs, and CR1 inference’s VRAM demand is down by 10×, compared to their respective aggressive baselines.
ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device
Chen Lai ⋅ Cemal Bilgin ⋅ ⋅ Gregory Comer ⋅ ⋅ ⋅ ⋅ Lucy Qiu ⋅ Mengwei Liu ⋅ ⋅ Songhao Jia ⋅ ⋅ ⋅ Digant Desai ⋅ Hansong Zhang ⋅ Manuel Candales ⋅ Scott Roy ⋅ Sicheng Jia ⋅ Mergen Nachin ⋅ ⋅ Yanan Cao ⋅ ⋅ Shunting Zhang ⋅ ⋅ Angela Yi ⋅ Zhenrui Zhang ⋅ Andrew Or ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Supriya Rao ⋅ ⋅ Soumith Chintala
Local execution of AI on edge devices is critical for privacy, low latency, and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete implementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from completely embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution ''backends''. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.
WAVE: A SYMBOLIC PYTHON DSL AND COMPILER FOR HIGH PERFORMANCE MACHINE LEARNING
Harsh Menon ⋅ ⋅ Gaurav Verma ⋅ Martin P. Lücke ⋅ ⋅ ⋅ Nithin Meganathan ⋅ Sanket Pandit ⋅ William Gallard Hatch ⋅ ⋅ ⋅ Sahil FAIZAL ⋅ ⋅
Modern ML models demand ever-greater compute, prompting hardware vendors to add specialized matrix cores to their GPUs. While these units unlock high throughput, they impose intricate programming models and addressing schemes that are difficult to manage by hand. This paper introduces Wave, a Python-embedded DSL for kernel authoring that automates these complex address computations and lets authors focus on core computation. In experiments, it matches or surpasses the performance of state-of-the-art kernel DSLs and libraries.