Skip to yearly menu bar Skip to main content


Session

Research-Track Oral Presentation: R16: Agentic AI

Grand Ballroom 1
Tue 19 May 1 p.m. PDT — 2:30 p.m. PDT
Abstract:
Chat is not available.


Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang ⋅ Yang Li ⋅ Ansong Ni ⋅ Ching-Feng Yeh ⋅ Youssef Emad ⋅ Xinjie Lei ⋅ Liam Robbins ⋅ Karthik Padthe ⋅ Hu Xu ⋅ Xian Li ⋅ Asli Celikyilmaz ⋅ Ramya Raghavendra ⋅ LIFEI HUANG ⋅ Carole-Jean Wu ⋅ Shang-Wen Li

Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.


Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Kirill Nagaitsev ⋅ Luka Grbcic ⋅ Samuel Williams ⋅ Costin Iancu

Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88× speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.


OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

Reyna Abhyankar ⋅ Qi Qi ⋅ Yiying Zhang

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the best agents take 1.5-2.4x more steps than necessary.


VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

Heng Ping ⋅ Arijit Bhattacharjee ⋅ Peiyu Zhang ⋅ Shixuan Li ⋅ Wei Yang ⋅ Anzhe Cheng ⋅ Xiaole Zhang ⋅ Jesse Thomason ⋅ Ali Jannesari ⋅ Nesreen Ahmed ⋅ Paul Bogdan

Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose \textbf{VeriMoA}, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a \textbf{quality-guided caching mechanism} to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a \textbf{multi-path generation strategy} that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that \ourtool achieves 15--30\% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.


When Enough is Enough: Rank-Aware Early Termination for Vector Search

Jianan Lu ⋅ Asaf Cidon ⋅ Michael None Freedman

Graph-based vector search underpins modern LLM applications such as retrieval-augmented generation (RAG), but its efficiency is increasingly constrained by disk I/O. Existing systems continue searching long after discovering the higher-ranked (i.e., most valuable) results for downstream applications. We present Terminus, a rank-aware early termination mechanism that dynamically aligns I/O spending with application utility. Terminus models per-I/O search utility using a rank-weighted function and terminates once recent I/Os yield negligible utility gains. By prioritizing I/O toward results that matter most to downstream tasks, Terminus achieves a better performance–accuracy trade-off. It delivers up to 1.4× higher throughput at the same accuracy target compared to existing early termination schemes, and up to 3.2× higher throughput than a baseline without early termination, with minimal impact on RAG accuracy.