Track: Industry Track Oral Presentation: Agentic AI/MLSys

Thu 21 May 8:30 - 8:45 PDT

ADR: AN AGENTIC DETECTION SYSTEMFORENTERPRISE AGENTIC AI SECURITY

Chenning Li ⋅ Pan Hu ⋅ Justin Xu ⋅ Baris Ozbas ⋅ Olivia Liu ⋅ Caroline Van ⋅ Manxue Li ⋅ Wei Zhou ⋅ Mohammad Alizadeh ⋅ Pengyu Zhang ⋅ KK Sriramadhesikan ⋅ Ming Zhang

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability– existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness– static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs– LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at UBER for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks, outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2–4× in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

Thu 21 May 8:45 - 9:00 PDT

PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving

Yuran Ding ⋅ Ruobing Han ⋅ Xiaofan Zhang ⋅ Xinwei Chen

Optimizing large-language model (LLM) training and serving on large-scale distributed systems is a significant challenge. This difficulty stems from the rapidly evolving LLM landscape, the requirement for deep domain expertise, and the need for workload-specific optimization strategies. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming, or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce PROMPTS, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning to deliver system-level optimization with much fewer shots. Key components of the proposed framework include an Analyzer Agent that diagnoses performance bottlenecks by synthesizing profiler data and a Proposal Agent that leverages a knowledge base to generate optimized sharding configurations with detailed justifications through retrieval-augmented generation (RAG). Experimental results across eight real-world LLM workloads have demonstrated that PROMPTS can provide valid reasoning and accurate recommendations by considering LLM workload characteristics and backend hardware features, delivering performance improvements of up to 434%. These workloads spanned LLMs with Mixture-of-Experts (MoE) and dense models, system configurations from 2-TPU chips to 512-chip systems with 2D/3D Torus interconnects, and the full LLM lifecycle including pre-training, post-training, and serving. To validate our agent's system optimization proposals, we benchmarked them against production configurations that were previously optimized by experts, either through extensive manual analysis or automated black-box searches. In every case, our agent independently identified this expert-validated solution within its top three recommendations from a single invocation. Furthermore, the agent's top-ranked recommendation matched the production solution in 87.5% of cases, demonstrating its ability to not only find optimized configurations but also to correctly prioritize the optimization candidates.

Thu 21 May 9:00 - 9:15 PDT

Agentic Operator Generation for ML ASICs

Alec Hammond ⋅ Aram Markosyan ⋅ Aman Dontula ⋅ Simon Mahns ⋅ Zacharias Fisches ⋅ Dmitrii Pedchenko ⋅ Keyur Muzumdar ⋅ Natacha Supper ⋅ Site Cao ⋅ Haishan Zhu ⋅ Mark Saroufim ⋅ Joe Isaacson ⋅ Laura Wang ⋅ Warren Hunt ⋅ Kaustubh Gondkar ⋅ Roman Levenstein ⋅ Gabriel Synnaeve ⋅ Richard Li ⋅ Jacob Kahn ⋅ Ajit Mathews

We present TritorX, an agentic AI system designed to generate functionally correct Triton PyTorch ATen kernels at scale for emerging accelerator platforms. TritorX integrates large language models with a custom linter, JIT compilation, and a PyTorch OpInfo-based test harness. This pipeline is compatible with both real Meta Training and Inference Accelerator (MTIA) silicon and in hardware simulation environments for next-generation devices. In contrast to previous kernel-generation approaches that prioritize performance for a limited set of high-usage kernels, TritorX prioritizes coverage. Our system emphasizes correctness and generality across the entire operator set, including diverse data types, shapes, and argument patterns. In our experiments, TritorX successfully generated kernels and wrappers for 481 unique ATen operators that pass all corresponding PyTorch OpInfo tests (over 20,000 in total). TritorX paves the way for overnight generation of complete PyTorch ATen backends for new accelerator platforms.

Thu 21 May 9:15 - 9:30 PDT

Cost-aware Duration Prediction for Software Upgrades in Datacenters

Yi Ding ⋅ Aijia Gao ⋅ Thibaud Ryden ⋅ Michal Sedlak ⋅ Essam Ewaisha ⋅ Igor Marnat ⋅ Henry (Hank) Hoffmann

Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Acela, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Acela accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Acela significantly outperforms the existing upgrade scheduler by improving upgrade window utilization by 1.25x, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4x. The code and data sets will be released after paper acceptance.

Thu 21 May 9:30 - 9:45 PDT

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

Xingyao Wang ⋅ Simon Rosenberg ⋅ Juan Michelini ⋅ Calvin Smith ⋅ Hoang Tran ⋅ Engel Nyst ⋅ Rohit Malhotra ⋅ Xuhui Zhou ⋅ Valerie Chen ⋅ Robert Brennan ⋅ Graham Neubig

Agents are now used widely in the process of software development, but building production-ready software engineering agents is a complex task. Deploying software agents effectively requires flexibility in implementation and experimentation, reliable and secure execution, and interfaces for users to interact with agents. In this paper, we present the OpenHands Software Agent SDK, a toolkit for implementing software development agents that satisfy these desiderata. This toolkit is a complete architectural redesign of the agent components of the popular OpenHands framework for software development agents. To achieve flexibility, we design a simple interface for implementing agents that requires only a few lines of code in the default case, but is easily extensible to more complex full-featured agents with features such as custom tools, memory management, and more. For security and reliability, it delivers seamless local-to-remote execution portability, integrated REST/WebSocket services. For interaction with human users, it can connect directly to a variety of interfaces, such as visual workspaces (VS Code, VNC, browser), command-line interfaces, and APIs. Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis. We validate the architecture empirically: production deployment data shows that V1 substantially reduces system-attributable failures over V0 with negligible event-sourcing overhead, and evaluations across multiple models and benchmarks demonstrate strong agent performance. Put together, these elements allow the OpenHands Software Agent SDK to provide a practical foundation for prototyping, unlocking new classes of custom applications, and reliably deploying agents at scale.

Thu 21 May 9:45 - 10:00 PDT

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Zhen Zheng ⋅ Xin Ji ⋅ Taosong Fang ⋅ Fanghao Zhou ⋅ Chuanjie Liu ⋅ Gang Peng

Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indicator for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may be prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments. Code is available at [https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064](https://github.com/microsoft/MixLLM/tree/batchllm_vllm_064).

Main Navigation

Session