Session
Session 4: Measurement and Analysis
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Zhiqiang Xie · Hao Kang · Ying Sheng · Tushar Krishna · Kayvon Fatahalian · Christos Kozyrakis
With more advanced natural language understanding and reasoning capabilities, agents powered by large language models (LLMs) are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emerging behaviors relevant to social science research and innovative gameplay development. However, current multi-agent simulations frequently suffer from inefficiencies due to the limited parallelism caused by false dependencies, resulting in a performance bottleneck. In this paper, we introduce AI Metropolis, a simulation engine that improves the efficiency of LLM agent simulations by incorporating out-of-order execution scheduling. By dynamically tracking real dependencies between agents, AI Metropolis minimizes false dependencies, enhances parallelism, and maximizes hardware utilization. Our evaluations demonstrate that AI Metropolis achieves speedups from 1.3× to 4.15× over standard parallel simulation with global synchronization, approaching optimal performance as the number of agents increases.
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Yinfang Chen · Manish Shetty · Gagan Somashekar · Minghua Ma · Yogesh Simmhan · Jonathan Mace · Chetan Bansal · Rujia Wang · S R
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys diverse cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion
Tianshu Huang · Arjun Ramesh · Emily Ruppel · Nuno Pereira · Anthony Rowe · Carlee Joe-Wong
Accurately estimating workload runtime is a longstanding goal in computer systems, and plays a key role in efficient resource provisioning, latency minimization, and various other system management tasks. Runtime prediction is particularly important for managing increasingly complex distributed systems in which more sophisticated processing is pushed to the edge in search of better latency. Previous approaches for runtime prediction in edge systems suffer from poor data efficiency or require intensive instrumentation; these challenges are compounded in heterogeneous edge computing environments, where historical runtime data may be sparsely available and instrumentation is often challenging. Moreover, edge computing environments often feature multi-tenancy due to limited resources at the network edge, potentially leading to interference between workloads and further complicating the runtime prediction problem. Drawing from insights across machine learning and computer systems, we design a matrix factorization-inspired method that generates accurate interference-aware predictions with tight provably-guaranteed uncertainty bounds. We validate our method on a novel WebAssembly runtime dataset collected from 24 unique devices, achieving a prediction error of 5.2\% - 2x better than a naive application of existing methods.
Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework
Neel P. Bhatt · Yunhao Yang · Rohan Siva · Daniel Milan · Ufuk Topcu · Atlas Wang
Multimodal foundation models offer a promising framework for robotic perception and planning by processing sensory inputs to generate actionable plans. However, addressing uncertainty in both perception (sensory interpretation) and decision-making (plan generation) remains a critical challenge for ensuring task reliability. This paper presents a comprehensive framework to disentangle, quantify, and mitigate these two forms of uncertainty. We first introduce a framework for uncertainty $\textit{disentanglement}$, isolating $\textit{perception uncertainty}$ arising from limitations in visual understanding and $\textit{decision uncertainty}$ relating to the robustness of generated plans.To quantify each type of uncertainty, we propose methods tailored to the unique properties of perception and decision-making: we use conformal prediction to calibrate perception uncertainty and introduce Formal-Methods-Driven Prediction (FMDP) to quantify decision uncertainty, leveraging formal verification techniques for theoretical guarantees. Building on this quantification, we implement two targeted $\textit{intervention}$ mechanisms: an active sensing process that dynamically re-observes high-uncertainty scenes to enhance visual input quality and an automated refinement procedure that fine-tunes the model on high-certainty data, improving its capability to meet task specifications. Empirical validation in real-world and simulated robotic tasks demonstrates that our uncertainty disentanglement framework reduces variability by up to 40\% and enhances task success rates by 5\% compared to baselines. These improvements are attributed to the combined effect of both interventions and highlight the importance of uncertainty disentanglement, which facilitates targeted interventions that enhance the robustness and reliability of autonomous systems. Webpage, videos, demo, and code: https://uncertainty-in-planning.github.io.
Software bloat refers to code and features that are not used by a software during runtime. For Machine Learning (ML) systems, bloat is a major contributor to their technical debt, leading to decreased performance and resource wastage. In this work, we present Negativa-ML, a novel tool to identify and remove bloat in ML frameworks by analyzing their shared libraries.Our approach includes novel techniques to detect and locate unnecessary code within GPU code - a key area overlooked by existing research.We evaluate Negativa-ML using four popular ML frameworks across ten workloads over 300 shared libraries.Our results demonstrate that ML frameworks are highly bloated on both the GPU and CPU code side, with GPU code being a primary source of bloat within ML frameworks.On average, Negativa-ML reduces the GPU code size by up to 75\% and the CPU code by up to 72\%, resulting in total file size reductions of up to 55\%.Through debloating, we achieve reductions in peak CPU memory usage, peak GPU memory usage, and execution time by up to 74.6\%, 69.6\%, and 44.6\%, respectively.