Skip to yearly menu bar Skip to main content


Session

Industry-Track Oral Presentation: I3: Benchmarks

Grand Ballroom 1
Thu 21 May 4:30 p.m. PDT — 6 p.m. PDT
Abstract:
Chat is not available.


AIRS: Scaling Live Inference in Resource Constrained Environments

Nilesh Jagnik ⋅ ⋅ ⋅ ⋅ Harshvardhan GM

Advancements in large language models (LLMs) have made them increasingly useful for complex reasoning tasks which previously required domain experts. One such task is quality evaluation of query responses produced by a search engine. Evaluation generates metrics necessary to study the quality, impact, and usefulness of product changes and features. Typically, to compute evaluation metrics, human experts are asked to rate various attributes of search responses. This process is generally quite expensive and requires several days to complete. As an alternative, LLMs are now being used to perform rating tasks with lower costs and latency. In addition, many new metrics are being developed to evaluate Google's new AI-based offerings, which require ratings too. As a result, there is much higher demand for LLM rating prediction tasks in comparison with the allocated TPU (Tensor Processing Unit) budget. A larger portion of the company's TPU resources are reserved for serving live user traffic. In this paper, we present the AI Rater Service (AIRS), an inference pipeline that employs several software engineering techniques to generate AI ratings with high reliability and low latency. AIRS maximizes LLM inference throughput by optimizing TPU resource utilization across various evaluation workflows, while minimizing latency for higher priority tasks.


MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Hanjiang Wu ⋅ Changhai Man ⋅ Jinsun Yoo ⋅ Huan Xu ⋅ William Won ⋅ ⋅ Winston Liu ⋅ Andrey Balogh ⋅ Dan Mihailescu ⋅ Brad B ⋅ Vinay Ramakrishnaiah ⋅ Spandan More ⋅ Saeed Rashidi ⋅ Louis Feng ⋅ Ashwin Ramachandran ⋅ Puneet Sharma ⋅ ⋅ Vijay Janapa Reddi ⋅ David Kanter ⋅ Tushar Krishna

We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra Execution Traces~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra traces collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to, NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.


ML Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput

Arissa Wongpanich ⋅ Tayo Oguntebi ⋅ ⋅ Yu Wang ⋅ Phitchaya Phothilimthana ⋅ Ritwika Mitra ⋅ Zongwei Zhou ⋅ Naveen Kumar ⋅ Vijay Janapa Reddi

Machine learning (ML) infrastructures operating at warehouse scale present unique performance characterization challenges beyond traditional high-performance computing metrics. This paper introduces a systematic framework for analyzing ML fleet efficiency, demonstrated on Google's production TPU infrastructure comprising thousands of accelerators running diverse workloads. Our fleet-wide analysis reveals performance dependencies spanning the entire ML system stack, from hardware to model architecture, data pipelines, frameworks, compilers, and schedulers. We identify critical gaps in conventional utilization-based performance metrics and propose "ML Productivity Goodput" (MPG) to capture fleet-wide efficiency across heterogeneous ML environments. MPG decomposes efficiency into scheduling, runtime, and program components, enabling precise identification of bottlenecks at specific system layers. Applied to Google's production TPU workloads, our segmented analysis identified optimization opportunities across the stack: scheduling goodput exceeding 95% for all job sizes through careful preemption tuning, runtime improvements via framework modernization and asynchronous checkpointing, and program-level gains through compiler optimizations like communication-computation overlap. This establishes MPG as a practical methodology for managing large-scale ML computing infrastructure.


ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

Bohua Zou ⋅ ⋅ ⋅ Weihao Xu ⋅ Binqi Sun ⋅ ⋅

As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today’s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions—is this workload memory-bound or compute-bound?—often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers—without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.

SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the \emph{KOKARYOKU PHY} bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked \textbf{49th} by HPL and is the only top 100 system that uses a fully open networking stack—\textbf{800~GbE} with \textbf{SONiC}—demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95~PFLOP/s (HPL~Rmax), 396.295~TFLOP/s (HPCG), and 339.86~PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2~PB all-flash Lustre file system, interconnected via a rail-optimized 800~GbE leaf–spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.


XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack

Clive Verghese ⋅ Prasanna Rengasamy ⋅ ⋅ Yin Zhang ⋅ Jiya Zhang ⋅ ⋅ Charles Alaras ⋅ Aditya Sharma ⋅ ⋅ ⋅ Rushabh Lalwani ⋅ Sannidhya Chauhan ⋅ Sai Ganesh Bandiatmakuri ⋅ ⋅ Ani Udipi ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Naveen Kumar ⋅ ⋅ Sayce Falk ⋅ ⋅

Optimizing Large Models across thousands of accelerators requires deep system expertise. To address modern machine learning (ML) optimization needs, we present XProf, the ML profiler for the OpenXLA ecosystem. XProf delivers actionable optimization suggestions and in-depth performance analysis, empowering ML researchers and framework users to improve efficiency without specialized systems knowledge. XProf provides a unified, full-stack view of both host (CPU) and device (accelerator - TPUs/GPUs) performance, leveraging tools like the Roofline Model for comprehensive analysis. XProf’s distributed architecture is designed to monitor thousands of chips with minimal workload overhead (<1%). This architecture is made pluggable through the open-source PJRT C API extension, which has facilitated its adoption by third-party accelerator vendors. XProf has been instrumental in achieving significant efficiency gains at Google and winning MLPerf submissions. This paper presents the design and architecture of XProf, showcases its differentiating tools and capabilities, and highlights its impact within Google and across the industry as a state of the art ML profiler. XProf is available as part of the OpenXLA project at https://github.com/openxla/xprof.