Skip to yearly menu bar Skip to main content


Session

Poster Session 3

Evergreen Ballroom
Thu 21 May 6 p.m. PDT — 8 p.m. PDT
Abstract:
Chat is not available.


Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Yilong Zhao ⋅ Jiaming Tang ⋅ Kan Zhu ⋅ Zihao Ye ⋅ Chi-Chih Chang ⋅ Chaofan Lin ⋅ Jongseok Park ⋅ Guangxuan Xiao ⋅ Mohamed Abdelfattah ⋅ Mingyu Gao ⋅ Baris Kasikci ⋅ Song Han ⋅ Ion Stoica

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SpecGen, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SpecGen features a novel sparse attention mechanism \textit{PillarAttn} as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SpecGen co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SpecGen outperforms state-of-the-art solutions, with an up to $2.13\times$ throughput speedup.


AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang ⋅ Shaowei Zhu ⋅ ⋅ ⋅ Allen Nie ⋅ Zhen Jia ⋅ Nandita Vijaykumar ⋅ Yida Wang ⋅ Kunle Olukotun

We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\%$ to $61\%$ on Trainium 1 and from $45\%$ to $59\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\times$ cheaper.


ADS: AN AGENTIC DETECTION SYSTEM FOR ENTERPRISE AGENTIC AI SECURITY

Chenning Li ⋅ Pan Hu ⋅ Justin Xu ⋅ Baris Ozbas ⋅ Olivia Liu ⋅ Caroline Van ⋅ ⋅ Wei Zhou ⋅ Mohammad Alizadeh ⋅ Pengyu Zhang

We present ADR (Agentic AI Detection and Response), the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability, as existing telemetry fails to capture reasoning and tool-execution chains; (2) insufficient robustness, given vast, dynamic enterprise contexts and extreme class imbalance; and (3) high detection costs, as LLM-based inference is computationally expensive. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for continuous red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. On ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), ADR achieves zero false positives while detecting 67% of attacks—outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2–4×. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks. Over ten months of telemetry, ADR sustained reliable detection in production, uncovering credential exposures and enabling a shift-left prevention layer with 97.2% precision. ADR’s source code and benchmark will be publicly available.


Agentic Operator Generation for ML ASICs

Alec Hammond ⋅ Aram Markosyan ⋅ Aman Dontula ⋅ ⋅ Zacharias Fisches ⋅ Dmitrii Pedchenko ⋅ Keyur Muzumdar ⋅ ⋅ Mark Saroufim ⋅ Joe Isaacson ⋅ ⋅ Warren Hunt ⋅ ⋅ ⋅ Gabriel Synnaeve ⋅ ⋅ Jacob Kahn ⋅

We present TritorX, an agentic AI system designed to generate functionally correct Triton PyTorch ATen kernels at scale for emerging accelerator platforms. TritorX integrates open-source large language models with a custom linter, JIT compilation, and a PyTorch OpInfo-based test harness. This pipeline operates both on deployed Meta Training and Inference Accelerator (MTIA) silicon and in hardware simulation environments for next-generation devices. In contrast to previous kernel-generation approaches that prioritize performance for a limited set of high-usage kernels, TritorX prioritizes coverage. Our system emphasizes correctness and generality across the entire operator set, including diverse data types, shapes, and argument patterns. In our experiments, TritorX successfully generated kernels and wrappers for 481 unique ATen operators that pass all corresponding PyTorch OpInfo tests (over 20,000 in total). TritorX paves the way for overnight generation of complete PyTorch ATen backends for new accelerator platforms.


AIRS: Scaling Live Inference in Resource Constrained Environments

Nilesh Jagnik ⋅ ⋅ ⋅ ⋅ Harshvardhan GM

Advancements in large language models (LLMs) have made them increasingly useful for complex reasoning tasks which previously required domain experts. One such task is quality evaluation of query responses produced by a search engine. Evaluation generates metrics necessary to study the quality, impact, and usefulness of product changes and features. Typically, to compute evaluation metrics, human experts are asked to rate various attributes of search responses. This process is generally quite expensive and requires several days to complete. As an alternative, LLMs are now being used to perform rating tasks with lower costs and latency. In addition, many new metrics are being developed to evaluate Google's new AI-based offerings, which require ratings too. As a result, there is much higher demand for LLM rating prediction tasks in comparison with the allocated TPU (Tensor Processing Unit) budget. A larger portion of the company's TPU resources are reserved for serving live user traffic. In this paper, we present the AI Rater Service (AIRS), an inference pipeline that employs several software engineering techniques to generate AI ratings with high reliability and low latency. AIRS maximizes LLM inference throughput by optimizing TPU resource utilization across various evaluation workflows, while minimizing latency for higher priority tasks.


ApproxMLIR : Accuracy-Aware Compiler for Compound ML System

Hao Ren ⋅ Yi Mu ⋅ Sasa Misailovic

Many compound AI systems are inherently “approximate” because the the ML components (e.g. a large language model) are probabilistic models and the non-ML components (e.g. retrieval-augmented generation) are heuristic. Such systems benefit from trading off result quality for improved performance. While extensive work exists on approximating ML and non-ML components individually, the wide deployment of LLMs in compound systems presents significant opportunities for end-to-end, accuracy-aware compilation. However, tailoring approximations across these different components is challenging to implement. This difficulty comes from their reliance on different software stacks for compilation and execution, as well as deployment on different hardware. To address these issues, we present ApproxMLIR, a reusable accuracy-aware compilation toolchain. ApproxMLIR introduces the approx MLIR dialect that serves as a unified and centralized interface for defining approximations and approx-opt, a reusable MLIR-based optimizer, which applies approximate transformations on ML and non-ML components. Our evaluation on three compound AI systems, which combine LLMs with information retrieval tasks and tool calling. The evaluation shows that ApproxMLIR can can effectively represent many common approximation choices, discover profitable points in the accuracy-performance space and consistently achieve higher speedups compared to static approximation strategies.


Attribution-based Sparse Activation in Large Language Models

Jifeng Song ⋅ Xiangyu Yin ⋅ Boyuan Yang ⋅ Kai Huang ⋅ Weichen Liu ⋅ Wei Gao

LLM inference is computationally expensive due to the LLM's large parameter sizes. Existing techniques reduce the computing cost via model retraining, but cannot well adapt to different downstream tasks or variant input data at runtime. To avoid such retraining efforts for runtime adaptability, a better option is \emph{sparse activation} that selectively deactivates an input-dependent set of neurons in inference, but current methods of \emph{lossless} sparse activation only deactivate neurons with zero output magnitudes, and are ineffective on recent LLMs with higher parameter efficiency. In this paper, we present a new technique of attribution-based sparse activation, which is a \emph{lossy} sparse activation technique that deactivates neurons with low attribution scores and aims to achieve the best tradeoff between model accuracy and computing costs. To ensure optimal sparse activation, we quantified the large errors of existing attribution metrics when used for sparse activation, due to the interdependency among attribution scores of different neurons, and further proposed a new attribution metric that can provably correct such errors. Experiments show that our technique can achieve 70\% model sparsity in difficult generative tasks such as question answering and text summarization with <5\% model accuracy loss. Such high model sparsity enables us to reduce the computing latency and memory use of LLM inference by 35\% and 40\%, respectively.


AXLearn: Modular, Hardware-Agnostic Large Model Training

Mark Lee ⋅ Tom Gunter ⋅ Chang Lan ⋅ ⋅ Hanzhi Zhou ⋅ ⋅ Sneha Bangalore ⋅ ⋅ ⋅ Xianzhi Du ⋅ Philipp Dufter ⋅ ⋅ Ruixuan Hou ⋅ Haoshuo Huang ⋅ ⋅ Xiang Kong ⋅ Jinhao Lei ⋅ Tao Lei ⋅ Meng Li ⋅ Li Li ⋅ Jiarui Lu ⋅ Zhiyun Lu ⋅ ⋅ ⋅ ⋅ ⋅ Zhucheng Tu ⋅ Chong Wang ⋅ Jianyu Wang ⋅ ⋅ Zirui Wang ⋅ ⋅ Sam Wiseman ⋅ Guoli Yin ⋅ ⋅ Xiyou Zhou ⋅ Danyang Zhuo ⋅ ⋅ Ruoming Pang

AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn at Apple.


BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng ⋅ Xin Ji ⋅ Taosong Fang ⋅ Fanghao Zhou ⋅ Chuanjie Liu ⋅

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments.


Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Tiyasa Mitra ⋅ Ritika Borkar ⋅ Nidhi Bhatia ⋅ Shivam Raj ⋅ hongkuan zhou ⋅ Yan Ru Pei ⋅ ⋅ Kyle ⋅ Ramon Matas ⋅ Dheevatsa Mudigere ⋅ Ritchie Zhao ⋅ Maximilian Golub ⋅ Arpan Dutta ⋅ ⋅ Sailaja Madduri ⋅ Dharmesh Jani ⋅ Brian Pharris ⋅ Itay Neeman ⋅ Bita Darvish Rouhani

As inference scales to multi-node deployments, prefill-decode disaggregation — splitting inference into distinct phases — offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, large-scale deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. These insights, in conjunction with the deployment flexibility offered by NVIDIA Dynamo, provide a foundation to navigate the trade-off between system throughput and interactivity in efficient disaggregated deployments.


BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan ⋅ Cameron Shinn ⋅ ⋅ Jingze Cui ⋅ George Klimiashvili ⋅ ⋅ Perkz Zheng ⋅ Bo Li ⋅ Zhou Yuxin ⋅ Zhouhai Ye ⋅ Weijie You ⋅ Richard Cai ⋅ Julien Demouth ⋅ John D. Owens ⋅ Xia Hu ⋅ ⋅ Timmy Liu ⋅ Huizi Mao

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship the between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62$\times$ speedup for prefill at 74.7\% sparsity and a 1.40$\times$ speedup for decode at 73.2\% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.


CATWILD: Compiler Autotuning for TPU workloads in the Wild

Ignacio Cano ⋅ Yu Wang ⋅ Phitchaya Phothilimthana ⋅ Mike Burrows ⋅ ⋅ Matheus Camargo ⋅ Alexander Wertheim ⋅ Chao Wang ⋅ David Liu ⋅ Tengyu Sun ⋅ Arissa Wongpanich ⋅ Christof Angermueller ⋅ ⋅ ⋅ Vineetha Govindaraj ⋅ Amit Sabne ⋅ ⋅ ⋅ Berkin Ilbeyi ⋅ Ryan Lefever ⋅ Mehrdad Khani ⋅ Subhankar Shah ⋅ Ankit Sinha ⋅ ⋅ ⋅ ⋅ Nikhil Sarda ⋅ ⋅ ⋅ ⋅ Emily Donahue ⋅ Sami Abu-El-Haija ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Naveen Kumar

Compilers play a fundamental role at achieving peak performance for machine learning (ML) workloads. However, given the diverse nature of workloads and accelerators, compilers' heuristics and analytical cost models often result in sub-optimal performance, and thus waste precious datacenter resources. Furthermore, the multitude of tunable parameters and their complex interplay often make it impossible for human experts to manually find optimal configurations. In this paper, we present CATWILD, a system that automatically optimizes ML jobs in Google's TPU fleet using compiler autotuning techniques. We describe CATWILD’s design and implementation, and evaluate its performance using a handful of representative metrics. We further report experiences and lessons learned from its five-year development and operation. To the best of our knowledge, CATWILD represents the first ML compiler autotuning solution deployed in datacenters at scale. Its successful rollout yielded substantial benefits, optimizing over 70% of daily TPU training jobs and achieving significant chip savings.


CDLM: CONSISTENCY DIFFUSION LANGUAGE MODELS FOR FASTER SAMPLING

Minseo Kim ⋅ Chenfeng Xu ⋅ Coleman Hooper ⋅ Harman Singh ⋅ Ben Athiwaratkun ⋅ Ce Zhang ⋅ Kurt Keutzer ⋅ Amir Gholami

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and an inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6×-12.8× lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://anonymous.4open.science/r/ConsistencyDLManonymous-3E88/.


Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Mengtian Yang ⋅ Zhekun Zhang ⋅ Mingheng Wu ⋅ jianwen yan ⋅ Hanshi Sun ⋅ Li-Wen Chang

Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating “what-if” Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with over 10,000 GPUs. In a practical inference deployment case, Charon discovered a configuration that improved system throughput by 275% over a manually-tuned baseline, demonstrating its significant real-world value.

Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Zephyr, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Zephyr accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Zephyr significantly outperforms the existing upgrade scheduler by improving upgrade window utilization by 1.25x, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4x. The code and data sets will be released after paper acceptance.


Dataflow Is All You Need

Darshan Gandhi ⋅ Pushkar Nandkar ⋅ David Koeplinger ⋅ ⋅ Romy Tsoupidi ⋅ ⋅ ⋅ Tuowen Zhao ⋅ Reid Goodbar ⋅ ⋅ Leon Zhang ⋅ ⋅ John Long ⋅ Han Wang ⋅ ⋅ ⋅ ⋅ Yun Du ⋅ Håkan Zeffer ⋅ ⋅ Raghu Prabhakar

The autoregressive decode phase of token generation is often the performance bottleneck in modern AI workflows, thanks to powerful open-source models with large context windows coupled with techniques like chain-of-thought reasoning. Decoding is memory bandwidth bound: the speed of token generation is limited by the memory bandwidth utilized to read weights and KV cache values. However, GPUs only use as little as 21\% of the available bandwidth on weights and KV caches. Asynchronous execution is hard on GPUs, which creates CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different types of model architectures. We argue that to truly mitigate these overheads, \emph{Dataflow Is All You Need}. Dataflow architectures execute subgraphs of operations asynchronously on one or more chips, thereby naturally mitigating the overhead faced on GPUs. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU). We describe three key optimizations enabled by dataflow -- \emph{\textbf{KernelLooping}}, \emph{\textbf{BatchStreaming}}, and \emph{\textbf{ScheduleOffloading}} -- that generalize over models that are small, large, dense, MoEs, hybrids, and with different attention mechanisms. Collectively, these optimizations deliver more than \textbf{75\%} of the theoretical peak roofline performance for a wide range of popular open-source models. We study speculative decoding in detail and demonstrate a speed-up of more than \textbf{6$\times$} with speculative decoding. Finally, we also show that speculative decoding runs \textbf{1.7$\times$} faster on 16 SN40 RDUs than DGX H100 despite having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at cloud.sambanova.ai.

Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving $R^2$=0.987 on held-out test data ($R^2$ ranges from 0 to 1, with higher values indicating better predictive accuracy) with held-out-dimension generalization: hardware $R^2$=0.909, precision $R^2$=0.763. We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a +9.23pp drift upgrade affecting 1 in 5 queries, demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work. Verification: https://anonymous.4open.science/r/reviewer-verification-5F4E/ | DriftBench CLI: https://anonymous.4open.science/r/driftbench-7FEC/

Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a \emph{distributed decision problem} between orbit and ground. EarthSight introduces three core innovations: (1) \emph{multi-task inference} on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a \emph{ground-station query scheduler} that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) \emph{dynamic filter ordering}, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9$\times$ and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.


Efficient, VRAM-Constrained xLM Inference on Clients

Aditya Ukarande ⋅ Deep Shekhar ⋅ ⋅ Ram Rangan

To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. This means efficient support for: a) interactive use (i.e. batch size 1), b) high resolution VLM inference, c) dense and mixture-of-experts (MoE) LLMs, and d) adapting to system conditions (CPU thread count, CPU-GPU interconnect bandwidth, and VRAM budget) and inference conditions (phase of execution and context size). While recent CPU-GPU hybrid scheduling techniques show promise, to our best knowledge, no single product handles all of the above. In this paper, we address this problem with pipelined sharding, a novel, benchmark profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-ofexperts (MoE) LLMs. Using a combination of model sharding at layer or sub-layer levels, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing (IGI) software development kit (SDK) and the Cosmos-Reason-1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: time-to-first-token (TTFT) improves by up to 6.7× and tokens per second by up to 30× for LLMs, and CR1 inference’s VRAM demand is down by 10×, compared to their respective aggressive baselines.


Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Hongyi Jin ⋅ Bohan Hou ⋅ Guanjie Wang ⋅ Ruihang Lai ⋅ Jinqi Chen ⋅ Zihao Ye ⋅ Yaxing Cai ⋅ Yixin Dong ⋅ Xinhao Cheng ⋅ Zhihao Zhang ⋅ Yilong Zhao ⋅ Yingyi Huang ⋅ Lijie Yang ⋅ Jinchen Jiang ⋅ Gabriele Oliaro ⋅ ⋅ Xupeng Miao ⋅ Vinod Grover ⋅ Todd Mowry ⋅ Zhihao Jia ⋅ Tianqi Chen

Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.


ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device

Chen Lai ⋅ Cemal Bilgin ⋅ ⋅ Gregory Comer ⋅ ⋅ ⋅ ⋅ Lucy Qiu ⋅ Mengwei Liu ⋅ ⋅ Songhao Jia ⋅ ⋅ ⋅ Digant Desai ⋅ Hansong Zhang ⋅ Manuel Candales ⋅ Scott Roy ⋅ Sicheng Jia ⋅ Mergen Nachin ⋅ ⋅ Yanan Cao ⋅ ⋅ Shunting Zhang ⋅ ⋅ Angela Yi ⋅ Zhenrui Zhang ⋅ Andrew Or ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Supriya Rao ⋅ ⋅ Soumith Chintala

Local execution of AI on edge devices is critical for privacy, low latency, and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete implementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from completely embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution ''backends''. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory to reduce shared memory traffic in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN and 2.4$\times$ over Triton on B200 GPUs with BF16, reaching up to 1605 TFLOPs/s (71\% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.


Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

Bozhi You ⋅ Irene Wang ⋅ ⋅ Abhinav Jangda ⋅ Angélica Moreira ⋅ Roshan Dathathri ⋅ Divya Mahajan ⋅ Keshav Pingali

Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch’s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.


FlexScale: Flexible and High-Performance FSDP at Scale

Zezhou Wang ⋅ Youjie Li ⋅ Zhiqi Lin ⋅ Jiacheng Yang ⋅ Cong Xie ⋅ ⋅ ZHENG ZHONG ⋅ ⋅ Hongyu Zhu ⋅ Zhi Zhang ⋅ Xin Liu ⋅ Yanghua Peng

Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods—e.g., block-wise quantized training—and with optimizers such as Shampoo and Muon used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today’s implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce FlexScale, a redesigned FSDP framework that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. FlexScale natively supports efficient data placement required by FSDP, accommodates non-element-wise optimizers and block-wise quantization. As a result, FlexScale achieves 5$\sim$66\% higher throughput and 16$\sim$30\% lower memory usage than existing FSDP systems, while efficiently scales to 30K GPUs. FlexScale has been battle-tested in production and will be open-sourced to the MLSys community upon acceptance.


FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Chenhao Feng ⋅ Haoli Zhang ⋅ Shakhzod Ali-zade ⋅ Yanli Zhao ⋅ Liang Luo ⋅ Jennifer Cao ⋅ Lisen Deng ⋅ ⋅ Chenyu Zhao ⋅ ⋅ ⋅ ⋅ Tiantu Xu ⋅ Yi Zhang ⋅ Evgenii Kolpakov ⋅ Siqi Yan ⋅ Chuanhao Zhuge ⋅ Min Ni ⋅ Bi Xue ⋅ Qunshu Zhang ⋅ Shen Li

Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the strag- gler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.


Hawkeye: Reproducing GPU-Level Non-Determinism

⋅ Dan Boneh ⋅ Ilan Komargodski ⋅ Megha Srivastava

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations on CPUs. Using our framework, an auditor can re-execute a full model training or inference workflow executed on NVIDIA GPUs on a CPU, without any precision loss and without introducing any additional operations or slowdown on the GPU side. This is in stark contrast to prior approaches to verifiable machine learning that introduced significant computational overhead for the model provider. The main technical contribution underlying Hawkeye is a systematic algorithmic framework for numerical treatment within NVIDIA's Tensor Cores rounding, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication. Our framework consists of a sequence of carefully crafted tests that reduce the (otherwise exponential size) search space of potential options for each operation. We test and evaluate our framework on a variety of GPU architectures (including Ampere, and Hopper), as well as all available precision types (FP16, BF16). In all test cases, our framework recovers the exact implementation of operations underlying matrix multiplication, and therefore allows for the full reproduction of model training and inference workflows on a CPU.


HipKittens: Fast and Furious AMD Kernels

William Hu ⋅ Drew Wadsworth ⋅ Sean Siddens ⋅ ⋅ Daniel Fu ⋅ ⋅ Muhammad Osama ⋅ Christopher Ré ⋅ Simran Arora

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers — are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available baselines by $1.2 − 2.4\times$ ($d = 64$ attention, GQA non-causal backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels across GPU vendors.


IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Wanli Zhong ⋅ Haibo Feng ⋅ Zirui Zhou ⋅ Hanyang Peng ⋅ Shiqi Yu

Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly $\mathrm{dequantize}\rightarrow\mathrm{softmax}\rightarrow\mathrm{requantize}$ detour, which can account for up to 65\% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present \emph{IntAttention}, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies \emph{IndexSoftmax}, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. \emph{IntAttention} integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate \emph{IntAttention} and demonstrate consistent and substantial gains. Our method achieves up to \textbf{3.7×} speedup and \textbf{61\%} energy reduction over FP16 baselines and \textbf{2.0x} faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices.


LEANN: A Low-Storage Overhead Vector Index

Yichuan Wang ⋅ Zhifei Li ⋅ Shu Liu ⋅ Yongji Wu ⋅ Ziming Mao ⋅ Yilong Zhao ⋅ Xiao Yan ⋅ Zhiying Xu ⋅ Yang Zhou ⋅ Ion Stoica ⋅ Sewon Min ⋅ Matei Zaharia ⋅ Joseph Gonzalez

Embedding-based vector search underpins many important applications, such as recommendation and retrieval-augmented generation (RAG). It relies on vector indices to enable efficient search. However, these indices require storing high-dimensional embeddings and large index metadata, whose total size can be several times larger than the original data (e.g., text chunks). Such high storage overhead makes it difficult, or even impractical, to deploy vector search on personal devices or large-scale datasets. To tackle this problem, we propose LEANN, a storage-efficient index for vector search that recomputes embeddings on the fly instead of storing them, and compresses state-of-the-art proximity graph indices while preserving search accuracy. LEANN delivers high-quality vector search while using only a fraction of the storage (e.g., 5% of the original data) and supporting storage-efficient index construction and updates. On real-world benchmarks, LEANN reduces index size by up to 50× compared with conventional indices, while maintaining SOTA accuracy and comparable latency for RAG applications.


Massive-Scale Out-Of-Core UMAP on the GPU

Jinsol Park ⋅ Corey Nolet ⋅ Edward Raff ⋅ Tim Oates ⋅ Akira Naruse

The Uniform Manifold Approximation and Projection (UMAP) algorithm has become a widely popular technique to reduce the dimensionality of a set of vectors, both for visualization and as a pre-processing step for follow-on machine learning tasks. UMAP is often an integral part of iterative and exploratory workflows, but the heavy amount of compute and memory required makes scaling to tens or even hundreds of gigabytes of vectors intractable on the CPU, often taking several hours to days to complete. In this paper, we show how we improved UMAP while unlocking performance that permits interactive analysis, even at massive-scale. We introduce an out-of-core strategy with optional multi-GPU support, achieving up to 74× faster performance than the CPU baseline.


Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

Nicholas Santavas ⋅ Kareem Eissa ⋅ ⋅ Piotr Florek ⋅ Matteo Nulli ⋅ Stefan Vasilev ⋅ Seyyed Hashemi ⋅ Antonios Gasteratos ⋅ Shahram Khadivi

Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2× GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.


MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Hanjiang Wu ⋅ Changhai Man ⋅ Jinsun Yoo ⋅ Huan Xu ⋅ William Won ⋅ ⋅ Winston Liu ⋅ Andrey Balogh ⋅ Dan Mihailescu ⋅ Brad B ⋅ Vinay Ramakrishnaiah ⋅ Spandan More ⋅ Saeed Rashidi ⋅ Louis Feng ⋅ Ashwin Ramachandran ⋅ Puneet Sharma ⋅ ⋅ Vijay Janapa Reddi ⋅ David Kanter ⋅ Tushar Krishna

We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra Execution Traces~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra traces collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to, NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.


ML Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput

Arissa Wongpanich ⋅ Tayo Oguntebi ⋅ ⋅ Yu Wang ⋅ Phitchaya Phothilimthana ⋅ Ritwika Mitra ⋅ Zongwei Zhou ⋅ Naveen Kumar ⋅ Vijay Janapa Reddi

Machine learning (ML) infrastructures operating at warehouse scale present unique performance characterization challenges beyond traditional high-performance computing metrics. This paper introduces a systematic framework for analyzing ML fleet efficiency, demonstrated on Google's production TPU infrastructure comprising thousands of accelerators running diverse workloads. Our fleet-wide analysis reveals performance dependencies spanning the entire ML system stack, from hardware to model architecture, data pipelines, frameworks, compilers, and schedulers. We identify critical gaps in conventional utilization-based performance metrics and propose "ML Productivity Goodput" (MPG) to capture fleet-wide efficiency across heterogeneous ML environments. MPG decomposes efficiency into scheduling, runtime, and program components, enabling precise identification of bottlenecks at specific system layers. Applied to Google's production TPU workloads, our segmented analysis identified optimization opportunities across the stack: scheduling goodput exceeding 95% for all job sizes through careful preemption tuning, runtime improvements via framework modernization and asynchronous checkpointing, and program-level gains through compiler optimizations like communication-computation overlap. This establishes MPG as a practical methodology for managing large-scale ML computing infrastructure.


MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

Jiyuan Zhang ⋅ Yining Liu ⋅ Siqi Yan ⋅ Lisen Deng ⋅ Jennifer Cao ⋅ Shuqi Yang ⋅ Bi Xue ⋅ Min Ni ⋅ Shen Li

The pervasive “memory wall” bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads—driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over $4\times$ speedups and over $50\%$ memory savings compared to existing MoE frameworks. MoEBlaze has been deployed in Meta recommendation production.


NodeSweep: Practical Straggler Detection and Health Monitoring for Large-Scale Foundation Model Training

Guanliang Liu ⋅ Zoe Zeng ⋅ ⋅ Cong Cheng ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Alexander Zhipa ⋅ Ashvin Nihalani ⋅ Binxuan Huang ⋅ ⋅

As foundation model training scales to thousands of GPUs, maintaining consistent node performance becomes increasingly critical. Traditional health-checking methods such as NCCL or burn-in tests often fail to capture subtle performance degradations that can significantly impact large-scale training efficiency. In this paper, we present a comprehensive node health monitoring framework that integrates real-time performance tracking with a novel offline node sweep mechanism. Our approach effectively identifies problematic nodes that traditional methods overlook, especially under complex communication patterns common in distributed training. Extensive evaluations on production workloads show that our method improves mean FLOPS utilization (MFU) by up to 1.7×, reduces run-to-run variance from 20% to 1%, and increases the mean time to failure (MTTF) while reducing human intervention time. These improvements translate to substantial gains in training efficiency. The proposed solution is both practical and scalable, making it particularly valuable for production-scale foundation model training.


Optimizing Deployment Configurations for LLM Inference

Sungmin Cho ⋅ Jaewon Lee ⋅ Chunqiang Tang ⋅ Yejin Lee ⋅ Geonhwa Jeong ⋅ ⋅ Scott Batura ⋅ ⋅ ⋅ ⋅ Sijia Chen ⋅ ⋅ Bradley Davis ⋅ Summer Deng ⋅ ⋅ Emad El-Haraty ⋅ ⋅ Lu Fang ⋅ Lu Fang ⋅ Joshua Fromm ⋅ ⋅ ⋅ Liangpeng Guo ⋅ ⋅ ⋅ Jianyu Huang ⋅ Aya Ibrahim ⋅ ⋅ Hongyi Jia ⋅ Changkyu Kim ⋅ ⋅ ⋅ ⋅ ⋅ Xiaozhu Meng ⋅ Vlad Tiberiu Mihailescu ⋅ ⋅ Maxim Naumov ⋅ Michal Ostrowski ⋅ ⋅ ⋅ Sarunya Pumma ⋅ ⋅ ⋅ Jeremy Francis Reizenstein ⋅ Rajasi Saha ⋅ ⋅ ⋅ Ruan Silva ⋅ ⋅ Jon Swenson ⋅ ⋅ Chris Thi ⋅ ⋅ Yunfan Wang ⋅ Pengchao Wang ⋅ Wenchen Wang ⋅ ⋅ Bram Wasti ⋅ ⋅ ⋅ Jingyi Yang ⋅ ⋅ ⋅ Jing Zhang ⋅ Yi Zhen ⋅

Meta's Large Language Models (LLMs)---the Llama model family---serve nearly one billion monthly active users. Deploying these models for inference involved navigating a complex design space that spanned diverse hardware options (e.g., H100, H200, MI300X), multiple parallelism strategies (tensor, pipeline, expert, context, and data parallelism), and nuanced runtime choices (e.g., continuous batching versus prefill-decode disaggregation)---all while leveraging workload-specific characteristics and meeting stringent service level objectives (SLOs). This paper presents insights we gained from developing and applying a systematic approach to analyze millions of deployment configurations and identify those that maximize throughput while meeting latency SLOs. We share lessons learned from our experience operating Llama inference at scale, including trade-offs among runtime designs, the phase-specific nature of parallelism strategies, opportunities for leveraging hardware heterogeneity, platform scaling behaviors, and system-level implications of model architectures such as Mixture-of-Experts (MoE). We hope our production experience offers practical insights for the broader LLM inference community.


ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

Stuart H. Sul ⋅ Simran Arora ⋅ Benjamin Spector ⋅ Christopher Ré

Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance—data-transfer mechanisms, resource scheduling, and design overheads. With fewer than 50 lines of device code, PK achieves up to $2.33\times$ speedup for data- and tensor-parallel workloads, $4.08\times$ for sequence-parallel workloads, and $1.22\times$ for expert-parallel workloads.


Parrot: Persuasion and Agreement Robustness Rating of Output Truth

Yusuf Çelebi ⋅ Mahmoud ElHussieni ⋅ Özay Ezerceli

This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness-focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ($\leq11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.


PRISM: PARAMETRICALLY RESTRUCTURED INFERENCE FOR SPECULATIVE SAMPLING DRAFT MODELS

Xuliang Wang ⋅ Yuetao Chen ⋅ Maochan Zhen ⋅ Fang LIU ⋅ Xinzhou Zheng ⋅ Xingwu Liu ⋅ Hong Xu ⋅ Ming Li

Large Language Models (LLMs), constrained by their auto-regressive nature, have long suffered from expensive and slow decoding. Speculative sampling methods, capable of alleviating the memory bandwidth bottleneck, have attracted attention from both the system and AI research communities. The demand for high predictive performance has created a growing trend of training parametrically larger and more powerful draft models, which also introduces growing computation overhead. While existing works balance trade-offs to find a sweet spot, in this paper we dive further into this effectiveness and efficiency dilemma, addressing the issue with architectural innovation. By disaggregating the computation of each predictive step across different parameter sets, we restructure the computational paths for the draft models, successfully decoupling the representation capacity from the inference cost, which enables the model scalable and fast at the same time. We conduct extensive experiments showing that our PRISM drafter outperforms SoTA draft architectures on acceptance length and end-to-end throughput when trained with the same dataset. We also show that PRISM scales exceptionally well on large datasets while some other architectures fail. On average, PRISM speculative decoding can achieve more than 2.6x end-to-end speedup when integrated with an already highly optimized inference engine.


ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

Bohua Zou ⋅ ⋅ ⋅ Weihao Xu ⋅ Binqi Sun ⋅ ⋅

As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today’s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions—is this workload memory-bound or compute-bound?—often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers—without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.


PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving

Yuran Ding ⋅ Ruobing Han ⋅ Xiaofan Zhang ⋅ Xinwei Chen

Optimizing large-language model (LLM) training and serving on large-scale distributed systems is a significant challenge. This difficulty stems from the rapidly evolving LLM landscape, the requirement for deep domain expertise, and the need for workload-specific optimization strategies. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming, or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce \textbf{PROMPTS}, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning to deliver system-level optimization with much fewer shots. Key components of the proposed framework include an \textit{Analyzer Agent} that diagnoses performance bottlenecks by synthesizing profiler data and a \textit{Proposal Agent} that leverages a knowledge base to generate optimized sharding configurations with detailed justifications through retrieval-augmented generation (RAG). Experimental results across eight real-world LLM workloads have demonstrated that PROMPTS can provide valid reasoning and accurate recommendations by considering LLM workload characteristics and backend hardware features, delivering performance improvements of up to \textbf{434\%}. These workloads spanned LLMs with Mixture-of-Experts (MoE) and dense models, system configurations from 2-TPU chips to 512-chip systems with 2D/3D Torus interconnects, and the full LLM lifecycle including pre-training, post-training, and serving. To validate our agent's system optimization proposals, we benchmarked them against production configurations that were previously optimized by experts, either through extensive manual analysis or automated black-box searches. In every case, our agent independently identified this expert-validated solution within its top three recommendations from a \textbf{single invocation}. Furthermore, the agent's top-ranked recommendation matched the production solution in \textbf{87.5\%} of cases, demonstrating its ability to not only find optimized configurations but also to correctly prioritize the optimization candidates.


REPARO: LOSS-RESILIENT GENERATIVE CODEC FOR VIDEO CONFERENCING

Tianhong Li ⋅ Vibhaalakshmi Sivaraman ⋅ Pantea Karimi ⋅ Lijie Fan ⋅ Mohammad Alizadeh ⋅ Dina Katabi

Packet loss during video conferencing often results in poor quality and video freezing. Retransmitting lost packets is often impractical due to the need for real-time playback, and using Forward Error Correction (FEC) for packet recovery is challenging due to the unpredictable and bursty nature of Internet losses. Excessive redundancy leads to inefficiency and wasted bandwidth, while insufficient redundancy results in undecodable frames, causing video freezes and quality degradation in subsequent frames. We introduce Reparo — a loss-resilient video conferencing framework based on generative deep learning models to address these issues. Our approach generates missing information when a frame or part of a frame is lost. This generation is conditioned on the data received thus far, considering the model's understanding of how people and objects appear and interact within the visual realm. Experimental results, using publicly available video conferencing datasets, show that Reparo outperforms state-of-the-art FEC-based video conferencing solutions in terms of both video quality (measured through PSNR, SSIM, and LPIPS) and the occurrence of video freezes.


Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE

Zongpu Zhang ⋅ ⋅ Y. Charlie Hu ⋅ Qiang Xu ⋅ Jian Li ⋅ Haibing Guan

Despite the rapid adoption of large language models (LLMs) in mobile applications, deploying them efficiently on resource-constrained devices remains challenging due to limited compute, memory, and energy constraints. In this paper, we first evaluate the energy efficiency of state-of-the-art mobile LLM frameworks across multiple models and uncover a key inefficiency: the default governors make independent decisions which can result in 23.0–40.4% longer latency or 5.0–16.6% higher energy use compared to optimal frequency combinations. We then conduct an in-depth analysis to reveal the root cause–the lack of cross-resource coordination of these governors during prefilling and decoding. Building on these findings, we present CORE, a unified, energy-aware governor that jointly coordinates CPU, GPU, and memory frequencies for mobile LLM inference. Experiments across diverse LLMs show that CORE reduces time-to-first-token by 7.0–16.9% and time-per-token by 25.4–36.8% on average, without increasing energy per token.

SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the \emph{KOKARYOKU PHY} bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked \textbf{49th} by HPL and is the only top 100 system that uses a fully open networking stack—\textbf{800~GbE} with \textbf{SONiC}—demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95~PFLOP/s (HPL~Rmax), 396.295~TFLOP/s (HPCG), and 339.86~PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2~PB all-flash Lustre file system, interconnected via a rail-optimized 800~GbE leaf–spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.


Scaling Up Large Language Models Serving Systems for Semantic Job Search

Kayhan Behdin ⋅ Qingquan Song ⋅ Sriram Vasudevan ⋅ Jian Sheng ⋅ Xiaojing Ma ⋅ Zhengze Zhou ⋅ Chuanrui Zhu ⋅ Guoyao Li ⋅ Chanh Nguyen ⋅ ⋅ Hejian Sang ⋅ Ata Fatahi ⋅ ⋅ Xiaoqing Wang ⋅ Qing Lan ⋅ ⋅ Qi Guo ⋅ Caleb Johnson ⋅ Zhipeng Wang ⋅

Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to 40% while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by more than 10x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system’s throughput by 10x in a real-world deployment, while meeting our quality bar.


SchedFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

Yi Pan ⋅ Yile Gu ⋅ Luo Jinbin ⋅ Yibo Wu ⋅ Ziren Wang ⋅ ⋅ Ziyi Xu ⋅ Shengkai Lin ⋅ Stephanie Wang ⋅ Baris Kasikci

Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose SchedFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. SchedFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that SchedFlow can integrate four representative parallelism strategies into three state-of-the-art ML systems (vLLM, SGLang, HuggingFace Transformer) with minimal code changes, achieving up to a 1.24x throughput improvement.


SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving

⋅ ⋅ ⋅ ⋅ ⋅ Sahil Parmar ⋅ ⋅ ⋅ ⋅ ⋅

The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.


Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

⋅ ⋅ Ehsan K. Ardestani ⋅ ⋅ ⋅ ⋅ Zhaodong Wang ⋅ ⋅ Xu Zhang ⋅ ⋅ Ying Zhang

Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing strategy, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly impacts cluster performance and reliability. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta’s AI operations.


SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Jameson Sandler ⋅ Jacob K Christopher ⋅ ⋅ Ferdinando Fioretto

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: \textbf{(1)} the autoregressive dependency during drafting which limits parallelism, and \textbf{(2)} frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes \emph{SpecDiff-2}, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that \emph{SpecDiff-2} achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of $+55\%$ over previous baselines and obtaining up to $5.5\times$ average speed-up over standard decoding, without any loss of accuracy.


Speculative Decoding: Performance or Illusion?

Lily Liu ⋅ Jiaxiang Yu ⋅ Jongseok Park ⋅ Alvin Cheung ⋅ Ion Stoica

Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.


Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Dionysios Adamopoulos ⋅ ⋅ Georgios Goumas ⋅ Christina Giannoula

Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous—neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71× on average and up to 2.31× for end-to-end inference, and by 2.13× on average and up to 3.32× for layer-wise execution across diverse layer configurations.


The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Xingyao Wang ⋅ ⋅ Juan Michelini ⋅ Calvin Smith ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅

Building production-ready software engineering agents requires balancing fast research iteration with operational stability, secure deployment, and reproducible execution across diverse environments. \textbf{OpenHands V0}—an open-source agent system with 64k+ GitHub stars—validated community demand but revealed four key tensions: rigid sandboxing, scattered mutable configuration, blurred core–application boundaries, and limited extensibility. We present the \textbf{OpenHands Software Agent SDK}—the core of \textbf{OpenHands V1}—a complete architectural redesign that \emph{separates agent core from downstream applications}. The SDK embodies four principles: (i) \emph{optional isolation} (local-first, sandbox-on-demand); (ii) \emph{stateless components} with immutable configuration and event-sourced state; (iii) \emph{strict separation of concerns} between core and applications; and (iv) \emph{two-layer composability} enabling modular deployment across four packages (SDK, Tools, Workspace, Server) and extensibility through typed, swappable components. Built on these foundations, the SDK delivers \emph{seamless local-to-remote execution portability}, integrated REST/WebSocket services, and visual workspaces (VS Code, VNC, browser) for human-agent collaboration. Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in QA and security analysis. Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. By codifying lessons from V0, the OpenHands Agent SDK provides a practical foundation for prototyping, unlocking new classes of custom applications, \emph{and} reliably deploying agents at scale.


WAVE: A SYMBOLIC PYTHON DSL AND COMPILER FOR HIGH PERFORMANCE MACHINE LEARNING

Harsh Menon ⋅ ⋅ Gaurav Verma ⋅ Martin P. Lücke ⋅ ⋅ ⋅ Nithin Meganathan ⋅ Sanket Pandit ⋅ William Gallard Hatch ⋅ ⋅ ⋅ Sahil FAIZAL ⋅ ⋅

Modern ML models demand ever-greater compute, prompting hardware vendors to add specialized matrix cores to their GPUs. While these units unlock high throughput, they impose intricate programming models and addressing schemes that are difficult to manage by hand. This paper introduces Wave, a Python-embedded DSL for kernel authoring that automates these complex address computations and lets authors focus on core computation. In experiments, it matches or surpasses the performance of state-of-the-art kernel DSLs and libraries.


XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack

Clive Verghese ⋅ Prasanna Rengasamy ⋅ ⋅ Yin Zhang ⋅ Jiya Zhang ⋅ ⋅ Charles Alaras ⋅ Aditya Sharma ⋅ ⋅ ⋅ Rushabh Lalwani ⋅ Sannidhya Chauhan ⋅ Sai Ganesh Bandiatmakuri ⋅ ⋅ Ani Udipi ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Naveen Kumar ⋅ ⋅ Sayce Falk ⋅ ⋅

Optimizing Large Models across thousands of accelerators requires deep system expertise. To address modern machine learning (ML) optimization needs, we present XProf, the ML profiler for the OpenXLA ecosystem. XProf delivers actionable optimization suggestions and in-depth performance analysis, empowering ML researchers and framework users to improve efficiency without specialized systems knowledge. XProf provides a unified, full-stack view of both host (CPU) and device (accelerator - TPUs/GPUs) performance, leveraging tools like the Roofline Model for comprehensive analysis. XProf’s distributed architecture is designed to monitor thousands of chips with minimal workload overhead (<1%). This architecture is made pluggable through the open-source PJRT C API extension, which has facilitated its adoption by third-party accelerator vendors. XProf has been instrumental in achieving significant efficiency gains at Google and winning MLPerf submissions. This paper presents the design and architecture of XProf, showcases its differentiating tools and capabilities, and highlights its impact within Google and across the industry as a state of the art ML profiler. XProf is available as part of the OpenXLA project at https://github.com/openxla/xprof.