Track: Industry Track Oral Presentation: Benchmarks and Evaluation

Thu 21 May 16:30 - 16:45 PDT

XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack

Robert Hundt ⋅ Naveen Kumar ⋅ Jose Baiocchi Paredes ⋅ Scott Goodson ⋅ Clive Verghese ⋅ Prasanna Rengasamy ⋅ Kelvin Le ⋅ Jiya Zhang ⋅ Charles Alaras ⋅ Yin Zhang ⋅ Kan Cai ⋅ Jiten Thakkar ⋅ Sai Ganesh Bandiatmakuri ⋅ Yogesh SY ⋅ Ani Udipi ⋅ Vikas Agarwal

Optimizing Large Models across thousands of accelerators requires deep system expertise. To address modern machine learning (ML) optimization needs, we present XProf, the ML profiler for the OpenXLA ecosystem. XProf delivers actionable optimization suggestions and in-depth performance analysis, empowering ML researchers and framework users to improve efficiency without specialized systems knowledge. XProf provides a unified, full-stack view of both host (CPU) and device (accelerator - TPUs/GPUs) performance, leveraging tools like the Roofline Model for comprehensive analysis. XProf’s distributed architecture is designed to monitor thousands of chips with minimal workload overhead (<1%). This architecture is made pluggable through the open-source PJRT C API extension, which has facilitated its adoption by third-party accelerator vendors. XProf has been instrumental in achieving significant efficiency gains at Google and winning MLPerf submissions. This paper presents the design and architecture of XProf, showcases its differentiating tools and capabilities, and highlights its impact within Google and across the industry as a state of the art ML profiler. XProf is available as part of the OpenXLA project at https://github.com/openxla/xprof.

Thu 21 May 16:45 - 17:00 PDT

AIRS: Scaling Live Inference in Resource Constrained Environments

Nilesh Jagnik ⋅ Xiaohao Yang ⋅ Tuan Do ⋅ Chelsea Chen ⋅ Harshvardhan GM

Advancements in large language models (LLMs) have made them increasingly useful for complex reasoning tasks which previously required domain experts. One such task is quality evaluation of query responses produced by a search engine. Evaluation generates metrics necessary to study the quality, impact, and usefulness of product changes and features. Typically, to compute evaluation metrics, human experts are asked to rate various attributes of search responses. This process is generally quite expensive and requires several days to complete. As an alternative, LLMs are now being used to perform rating tasks with lower costs and latency. In addition, many new metrics are being developed to evaluate Google's new AI-based offerings, which require ratings too. As a result, there is much higher demand for LLM rating prediction tasks in comparison with the allocated TPU (Tensor Processing Unit) budget. A larger portion of the company's TPU resources are reserved for serving live user traffic. In this paper, we present the AI Rater Service (AIRS), an inference pipeline that employs several software engineering techniques to generate AI ratings with high reliability and low latency. AIRS maximizes LLM inference throughput by optimizing TPU resource utilization across various evaluation workflows, while minimizing latency for higher priority tasks.

Thu 21 May 17:00 - 17:15 PDT

SAKURAONE: An Open Ethernet–Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment

Fumikazu KONISHI ⋅ Yuuki Tsubouchi ⋅ Hirofumi Tsuruta

SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the KOKARYOKU PHY bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked 49th by HPL and is the only top 100 system that uses a fully open networking stack—800 GbE with SONiC—demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95 PFLOP/s (HPL Rmax), 396.295 TFLOP/s (HPCG), and 339.86 PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2 PB all-flash Lustre file system, interconnected via a rail-optimized 800 GbE leaf–spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.

Thu 21 May 17:15 - 17:30 PDT

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas ⋅ Andrey Balogh ⋅ Brad B ⋅ Brian Coutinho ⋅ Louis Feng ⋅ Sheng Fu ⋅ Sanshan Gao ⋅ Mehryar Garakani ⋅ Taekyung Heo ⋅ David Kanter ⋅ Josh Ladd ⋅ Ziwei Li ⋅ Winston Liu ⋅ Changhai Man ⋅ Dan Mihailescu ⋅ Spandan More ⋅ Joongun Park ⋅ Ashwin Ramachandran ⋅ Vinay Ramakrishnaiah ⋅ Saeed Rashidi ⋅ Vijay Janapa Reddi ⋅ Puneet Sharma ⋅ Phio Tian ⋅ William Won ⋅ Hanjiang Wu ⋅ Huan Xu ⋅ Jinsun Yoo ⋅ Tushar Krishna

The fast pace of artificial intelligence (AI) innovation demands an agile methodology for observation, reproduction and optimization of distributed machine learning (ML) workload behavior in production AI systems and enables efficient software-hardware (SW-HW) co-design for future systems. We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra Execution Traces (ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra traces collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to, NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.

Thu 21 May 17:30 - 17:45 PDT

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

Bohua Zou ⋅ Debayan Roy ⋅ Dhimankumar Airao ⋅ Weihao Xu ⋅ Binqi Sun ⋅ Yutao Liu ⋅ Haibo Chen

As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today’s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions—is this workload memory-bound or compute-bound?—often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, with a specific focus on resource-constrained edge devices, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers—without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.

Thu 21 May 17:45 - 18:00 PDT

Machine Learning Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput

Arissa Wongpanich ⋅ Tayo Oguntebi ⋅ Jose Baiocchi Paredes ⋅ Yu Wang ⋅ Phitchaya Phothilimthana ⋅ Ritwika Mitra ⋅ Zongwei Zhou ⋅ Naveen Kumar ⋅ Vijay Janapa Reddi

Machine learning (ML) infrastructures operating at warehouse scale present unique performance characterization challenges beyond traditional high-performance computing metrics. This paper introduces a systematic framework for analyzing ML fleet efficiency, demonstrated on Google's production TPU infrastructure comprising thousands of accelerators running diverse workloads. Our fleet-wide analysis reveals performance dependencies spanning the entire ML system stack, from hardware to model architecture, data pipelines, frameworks, compilers, and schedulers. We identify critical gaps in conventional utilization-based performance metrics and propose "ML Productivity Goodput" (MPG) to capture fleet-wide efficiency across heterogeneous ML environments. MPG decomposes efficiency into scheduling, runtime, and program components, enabling precise identification of bottlenecks at specific system layers. Applied to Google's production TPU workloads, our segmented analysis identified optimization opportunities across the stack: scheduling goodput exceeding 95% for all job sizes through careful preemption tuning, runtime improvements via framework modernization and asynchronous checkpointing, and program-level gains through compiler optimizations like communication-computation overlap. This establishes MPG as a practical methodology for managing large-scale ML computing infrastructure.

Main Navigation

Session