Track: Research Track Oral Presentation: ML for Systems

Wed 20 May 13:30 - 13:45 PDT

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Shanli Xing ⋅ Vivian Zhai ⋅ Alexander Jiang ⋅ Yixin Dong ⋅ Yong Wu ⋅ Zihao Ye ⋅ Charlie Ruan ⋅ Yingyi Huang ⋅ Yineng Zhang ⋅ Liangsheng Yin ⋅ Aksara Bayyapu ⋅ Luis Ceze ⋅ Tianqi Chen

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them safely into large-scale LLM inference systems.

Wed 20 May 13:45 - 14:00 PDT

Virtual Machine NUMA Placement at Scale: Learning the Norm, Shielding the Tail

Yibo Zhao ⋅ Tianyuan Wu ⋅ HUI XUE ⋅ Qi Chen ⋅ Zhenhua Han ⋅ Zikai Xu ⋅ Yuntai Chang ⋅ Rui Gao ⋅ Steve Deng ⋅ Jui-Hao Chiang ⋅ Mingxia Li ⋅ Yuqing Yang ⋅ Cheng Tan ⋅ Fan Yang ⋅ Peng Cheng ⋅ Yongqiang Xiong ⋅ Lili Qiu ⋅ Lidong Zhou

In modern data centers, servers organize memory and CPUs into Non-Uniform Memory Access (NUMA) nodes, where unequal memory-to-CPU proximity leads to varying memory latency. Hypervisors must carefully place Virtual Machines (VMs) to reduce remote memory access. Poor placements can lead to significant performance degradation—sometimes up to 30%. However, achieving optimal placement at scale is challenging due to the large number of VM configurations, diverse NUMA structures, and evolving workload patterns. We present Catur, a NUMA placement system designed for large-scale cloud environments. Catur leverages reinforcement learning to learn from production data. Moreover, to address real-world challenges, Catur integrates several techniques: robust action space design to prevent model collapse, reward shaping to address learning inefficiency, drift-aware continuous training for evolving workload patterns, and speculative shielding to mitigate VM performance anomalies. Evaluations on production traces with 100 million VMs demonstrate that Catur reduces average resource defect by 34.2%–50.0% compared to state-of-the-art hypervisor policies.

Wed 20 May 14:00 - 14:15 PDT

When Machine Learning Isn’t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty

Varun Gohil ⋅ Nevena Stojkovic ⋅ Noman Bashir ⋅ Sundar Dev ⋅ Gaurang Upasani ⋅ David Lo ⋅ Parthasarathy Ranganathan ⋅ Christina Delimitrou

Machine learning (ML) models are increasingly used in computer systems but often suffer from poor generalizability, leading to costly failures on out-of-distribution (OOD) data. We propose an uncertainty-aware framework that improves system resilience by quantifying prediction uncertainty at runtime and rejecting unreliable outputs before they cause harm. When a prediction is uncertain, the system gracefully degrades to a safe fallback strategy. We evaluate the framework across three case studies, server provisioning, cluster management, and storage I/O admission, and find that the best uncertainty estimator is not universal but depends on how its properties align with each task’s design and resource constraints. Similarly, the optimal fallback workflow (e.g., a lightweight and parallel vs. resource-intensive and sequential ) depends on task’s runtime latency constraints. Together, these findings offer a practical path towards building more reliable ML-driven computer systems.

Wed 20 May 14:15 - 14:30 PDT

Practical Adversarial Multi-Armed Bandits with Sublinear Runtime

Kasper Overgaard Mortensen ⋅ Ama Bembua Bainson ⋅ Mathias Tversted ⋅ Kristoffer Strube ⋅ Renata Borovica-Gajic ⋅ Andrea Paudice ⋅ Davide Mottin ⋅ Panagiotis Karras

We study the Multi-Armed Bandit problem in nonstationary adversarial environments, where the identity of the optimal arm can change over time due to shifts in the loss sequence. Motivated by applications such as physical design tuning in database systems, we focus on settings with a very large number of arms and seek practical algorithms with sublinear runtime. Our main contribution is a novel algorithm, Queuing Behind the Leader (QBL), which achieves a per-iteration complexity of O(m log k), where m is the number of arms selected at each step. QBL combines limited update operations via a priority queue, a constant sampling overhead, and a balanced exploration strategy. We evaluate QBL extensively on state-of-the-art benchmarks and demonstrate that it consistently outperforms existing methods in both time and solution quality.

Wed 20 May 14:30 - 14:45 PDT

Unified LLM Model for Power, Performance, and Area Prediction from Hardware Code

Armin Abdollahi ⋅ Mehdi Kamal ⋅ Massoud Pedram

We present RocketPPA, a unified LLM-based model that predicts power, performance, and area for Verilog designs across technology nodes and optimization styles. The approach combines a large language model backbone with mixture-of-experts regression and low-rank adaptation for parameter efficiency. To improve generalization, we introduce a contrastive learning framework that encourages semantically similar designs to cluster in embedding space, providing an inductive bias that reflects the structure of the hardware design space. Trained on 15nm and 45nm nodes with area- and delay-optimized flows, the model achieves 9.4 percentage point improvement in pass rate at ten percent tolerance over prior methods, with approximately 20$\times$ higher throughput (0.12 seconds per design). Ablations show contrastive learning contributes 2.5 points to accuracy, while leave-one-regime-out experiments demonstrate robust cross-regime generalization with minimal degradation. These results validate that combining supervised and contrastive objectives enables rapid, accurate PPA prediction across nodes and optimization styles.

Wed 20 May 14:45 - 15:00 PDT

Automated Algorithm Design for Auto-Tuning Optimizers

Floris-Jan Willemsen ⋅ Niki van Stein ⋅ Ben van Werkhoven

Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches such as evolutionary, annealing, or surrogate-based optimizers, designing algorithms that efficiently find near-optimal configurations robustly across diverse tasks is challenging. We propose a new paradigm: using large language models (LLMs) to automatically generate optimization algorithms tailored to auto-tuning problems. We introduce a framework that prompts LLMs with problem descriptions and search space characteristics to synthesize, test, and iteratively refine specialized optimizers. These generated algorithms are evaluated on four real-world auto-tuning applications across six hardware platforms and compared against the state-of-the-art in two contemporary auto-tuning frameworks. The evaluation demonstrates that providing additional application- and search space-specific information in the generation stage results in an average performance improvement of 30.7% and 14.6%, respectively. In addition, our results show that LLM-generated optimizers can rival, and in various cases outperform, existing human-designed algorithms, with our best-performing generated optimization algorithms achieving an average 72.4% improvement over state-of-the-art optimizers for auto-tuning.

Main Navigation

Session