Session
Poster Session 1 & Opening Reception
Evergreen Ballroom
AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents
Hojoon Kim ⋅ Yuheng Wu ⋅ Thierry Tambe
Large language models (LLMs) have recently been integrated into embodied AI agents, yet their synchronous plan-act loop imposes severe latency and cost bottlenecks. We present AgenticCache, a cache-driven asynchronous planning framework that decouples LLM reasoning from real-time execution. By identifying strong plan transition locality in embodied tasks, AgenticCache enables agents to reuse frequently occurring plan fragments and update them asynchronously through a background LLM process. This design converts idle waiting time into productive action while preserving context-aware decision quality. Across four multi-agent embodied benchmarks, AgenticCache improves task success rates by 24.34%, reduces simulation latency by 75%, and lowers token usage by 65% on average. These results demonstrate that caching and asynchronous reasoning together offer a path toward real-time, low-cost, and cognitively inspired autonomy in LLM-based agents.
Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing
Zhongshu Gu ⋅ ⋅ Salman Ahmed ⋅ Julian James stephen ⋅ ⋅ ⋅ Shixuan Zhao ⋅ Zhiqiang Lin
GPU Confidential Computing (GPU-CC), introduced with the NVIDIA Hopper architecture, extends confidential computing protections from CPUs to GPUs, enabling secure execution of AI workloads. For end users, enabling GPU-CC is seamless and requires no modifications to existing applications. However, behind this ease of adoption lies a proprietary and highly complex system whose opacity presents significant challenges for early adopters and system researchers seeking to understand its architecture and security landscape. In this work, we provide a security-focused look at GPU-CC by reconstructing a coherent view of the system. Our analysis begins from the GPU-CC’s blueprint, focusing on the specialized architectural engines that underpin its security design. We then investigate GPU-CC’s bootstrap process, which orchestrates hardware and software components to establish core security mechanisms. Finally, we conduct targeted experiments to evaluate whether, under the GPU-CC’s threat model, data transfers via different data paths remain secure when they cross the bridge between trusted CPU and GPU domains. All security findings presented in this paper have been reported responsibly to the NVIDIA Product Security Incident Response Team (PSIRT).
Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under semi-operational conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in a semi-operational environment.
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Siqi Chen ⋅ Ke Hong ⋅ Tianchen Zhao ⋅ Ruiqi Xie ⋅ Zhenhua Zhu ⋅ Xudong Zhang ⋅ Yu Wang
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a \textit{sparse imbalance ratio} to quantify the imbalance, and propose \textit{db}-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. \textit{db}-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, \textit{db}-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that \nickname delivers an end-to-end speedup of 1.25× and an attention-specific speedup of 1.40× over state-of-the-art sequence parallel methods on average.
DisAgg: Distributed Aggregators for Efficient Secure Aggregation
Haaris Mehmood ⋅ Giorgos Tatsis ⋅ Dimitrios Alexopoulos ⋅ Karthikeyan Saravanan ⋅ Jie Xi ⋅ ⋅ Mete Ozay
Federated learning enables collaborative model training across distributed clients, yet vanilla FL exposes client updates to the central server. Secure‑aggregation schemes protect privacy against an honest‑but‑curious server, but existing approaches often suffer from many communication rounds, heavy public‑key operations, or difficulty handling client dropouts. Recent methods like One‑Shot Private Aggregation (OPA) cut rounds to a single server interaction per FL iteration, yet they impose substantial cryptographic and computational overhead on both server and clients. We propose a new protocol that leverages a small committee of clients called \textit{aggregators} to perform the aggregation itself: each client secret‑shares its update vector to aggregators, which locally compute partial sums and return only aggregated shares for server‑side reconstruction. This design eliminates local masking and expensive homomorphic encryption, reducing endpoint computation while preserving privacy against a curious server and a limited fraction of colluding clients. By leveraging optimal trade-offs between communication and computation costs, extensive experiments with upto 50k users and 10k‑dimensional update vectors show that our protocol is at least $1.9\times$ faster than OPA, the previous best protocol.
Flash3DGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs
Lingjun Gao ⋅ Zhican Wang ⋅ Zhiwen Mo ⋅ Hongxiang Fan
Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality and efficient novel view synthesis, demonstrating great potential in real-world applications such as robotic perception and digital-twin construction. However, 3DGS requires processing up to millions of Gaussians in parallel, imposing significant computational and memory demands that limit its deployment on resource-constrained platforms. Through systematic profiling and analysis, this paper identifies several redundancy at both the algorithmic and system implementation levels. These insights motivate us to explore several novel optimizations, including adaptive early sorting, GPU-efficient axis-shared rasterization, and dynamic thresholding. Unlike prior work that focuses only on either algorithmic improvements or systems optimization, our approach explores a joint algorithm and system co-optimization to push the performance limits of 3DGS on GPUs. Comprehensive evaluation demonstrates that our co-optimization approach, named \textit{Flash3DGS} achieves a speed-up of up to $1.41 \times$ with negligible algorithmic performance drop in rendering image quality compared with the \textit{gsplat} baseline. Importantly, our co-optimization is orthogonal to most existing 3DGS acceleration methods, allowing for synergistic performance gains when used in combination. We plan to release our code publicly upon paper acceptance to support reproducibility and future research.
FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap
Taosong Fang ⋅ Zhen Zheng ⋅ Zhengzhao Ma ⋅ Yaojie Lu ⋅ Hongyu Lin ⋅ Xianpei Han ⋅ Le Sun
Large Language Models (LLMs) are increasingly deployed as collaborating agents in Multi-Agent Systems (MAS), where sequential agent interactions create significant latency bottlenecks. Traditional serving systems require each downstream agent to wait for complete upstream generation before starting prefill, leaving substantial idle time during inter-agent transitions. We present FlashAgents, a system that accelerates multi-agent workflows through token-level streaming and prefix-aware coordination. FlashAgents introduces Inter-agent streaming and incremental prefill, which streams tokens between agents and performs incremental prefill to overlap downstream prefill with upstream decode, reducing inter-agent latency. For concurrent workloads, an intra-turn prefix cache built on radix trees detects and eliminates redundant prefill across requests sharing common instruction templates, avoiding recomputation of shared prefixes within the same processing turn. Implemented on SGLang, FlashAgents achieves up to 46\% end-to-end latency reduction on real workflows and 3.5$\times$ speedup in controlled two-agent benchmarks, demonstrating consistent improvements across diverse models and interaction patterns.
FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models
Hariharan Ramesh ⋅ Jyotikrishna Dass
Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.
From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill
Gunjun Lee ⋅ ⋅ ⋅ Younjoo Lee ⋅ Jung Ho Ahn
Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to \textbf{39\%} and inflate energy consumption. We propose \textbf{layered prefill}, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to \textbf{70\%}, End-to-End latency by \textbf{41\%} and per-token energy by up to \textbf{22\%}. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.
G-HEMP: FAST MULTI-GPU PRIVATE INFERENCE FOR LARGE-SCALE GCNS WITH HOMOMORPHIC ENCRYPTION
Ran Ran ⋅ Zhaoting Gong ⋅ Zhaowei Li ⋅ Xianting Lu ⋅ Jiajia Li ⋅ Wujie Wen
Homomorphic Encryption (HE) offers a promising solution for privacy-preserving Graph Convolutional Net- works (GCN) inference in untrusted cloud environments by enabling computation directly on encrypted data. This capability is particularly valuable in applications such as recommendation systems, financial analysis, and bioinformatics, where the data is subject to strict privacy requirements. However, applying HE to large-scale GCN inference introduces substantial computational and memory overhead, which significantly limits scalability and runtime performance. Although prior works have demonstrated promising results with CPU-based implementa- tions, these approaches remain constrained in terms of throughput and scalability due to redundant HE operations and high memory demands. In this work, we present G-HEMP, the first framework that leverages the power of multi-GPU systems to accelerate large-scale private GCN inference. G-HEMP introduces two key innovations: (i) a block-diagonal parallel packing technique that eliminates redundant data replication for encrypted adjacency matrices, achieving up to 4.41× latency speedup over traditional feature-wise packing; and (ii) a multi-GPU workload partitioning strategy that reduces peak memory usage by 50% and improves inference latency by up to 1.98×. By combining these techniques, the number of HE operations is significantly reduced, and the encrypted computation can be partitioned and efficiently distributed across multiple GPUs to maximize throughput and hardware utilization. Our G-HEMP framework is model-agnostic and scales seamlessly with large GCN inference tasks. Together, these contributions enable scalable and efficient privacy-preserving GCN inference, advancing the practicality of HE-based GCN analytics on modern heterogeneous hardware.
HELIOS : Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar ⋅ Shashank Nag ⋅ Jason Clemons ⋅ LIZY JOHn ⋅ Poulami Das
Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose \textit{HELIOS}, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. \textit{First}, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. \textit{Second}, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.
Hippocampus: An Efficient and Scalable Memory Module for Agentic AI
Yi Li ⋅ Lianjie Cao ⋅ Faraz Ahmed ⋅ Puneet Sharma ⋅ Bingzhe Li
Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce \textbf{Hippocampus}, an agentic memory management system that uses compact binary signatures for semantic search and lossless token-ID streams for exact content reconstruction. Its core is a Dynamic Wavelet Matrix (DWM) that compresses and co-indexes both streams to support ultra-fast search in the compressed domain, thus avoiding costly dense-vector or graph computations. This design scales linearly with memory size, making it suitable for long-horizon agentic deployments.
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Dong Wang ⋅ Yang Li ⋅ Ansong Ni ⋅ Ching-Feng Yeh ⋅ Youssef Emad ⋅ Xinjie Lei ⋅ Liam Robbins ⋅ Karthik Padthe ⋅ Hu Xu ⋅ Xian Li ⋅ Asli Celikyilmaz ⋅ Ramya Raghavendra ⋅ LIFEI HUANG ⋅ Carole-Jean Wu ⋅ Shang-Wen Li
Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.
Retrieval-augmented generation (RAG) enables LLMs to ground responses in external knowledge, but long-term, multi-session conversations still suffer from implicit recall failures: when current user queries lack lexical overlap with earlier facts (e.g., preferences), standard dense retrieval and long-context prompting often miss the most relevant memories. We present a dialogue-aware RAG system that jointly addresses what to store and how to retrieve under constraints. Our design extracts durable user facts into a lightweight memory graph, enriches queries with conversational cues, performs hybrid retrieval, and uses a budget-aware router to balance quality and serving cost. On our Implicit Preference Recall benchmark, the system lifts Recall@10 to 0.70 (vs. 0.58 for dense-only) and improves nDCG@10 from 0.41 to 0.51. The system also reduces cross-modality disagreement by 47% and achieves a 81% cost reduction compared to long-context methods.
Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems
Kirill Nagaitsev ⋅ Luka Grbcic ⋅ Samuel Williams ⋅ Costin Iancu
Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88× speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
Reyna Abhyankar ⋅ Qi Qi ⋅ Yiying Zhang
Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the best agents take 1.5-2.4x more steps than necessary.
PLA-Serve: A Prefill-Length-Aware LLM Serving System
Jianshu She ⋅ Zonghang Li ⋅ HONGCHAO DU ⋅ Shangyu Wu ⋅ Wenhao Zheng ⋅ Eric Xing ⋅ Zhengzhong Liu ⋅ Huaxiu Yao ⋅ Chun Jason Xue ⋅ Qirong Ho
PLA-Serve identifies and disaggregates requests with different prompt lengths in LLM serving to reduce TTFT latency. While recent systems have decoupled the prefill and decode stages to improve throughput, they still rely on unified scheduling policies that fail to adapt to heterogeneous workload characteristics. We observe that prompt-length variations lead to distinct performance bottlenecks, motivating an adaptive scheduling strategy. PLA-Serve disaggregates multi-round long-prefill requests from short-prefill ones and introduces a length-aware smart batching mechanism for short-prefill workloads. It adopts a dual-queue design that supports temporal disaggregation on a single prefill instance or spatial disaggregation across multiple instances. For short-prefill batches, a batch waiting window and CUDA Graph–based clustering mitigate interference from heterogeneous computation, reducing batching delay and lowering average latency. In real multi-turn workloads, PLA-Serve reduces short-prefill latency by over 30% compared to vanilla SGLang under prefill–decode disaggregation, and decreases SLO violations by 28% in multi-instance deployments. Compared to the SGLang router with load balancing, it further lowers SLO violations by 12% in multi-GPU settings. Under high concurrency and mixed-request scenarios, PLA-Serve improves throughput by up to 35% for prefill instance, demonstrating its effectiveness in optimizing heterogeneous LLM serving workloads.
PLayer-FL: A Principled Approach to Personalized Layer-wise Cross-Silo Federated Learning
Ahmed Elhussein ⋅ Florent Pollet ⋅ Gamze Gursoy
Federated learning (FL) with non-IID data often degrades client performance below local training baselines. Partial FL addresses this by federating only early layers that learn transferable features, but existing methods rely on ad-hoc, architecture-specific heuristics. We first conduct a systematic analysis of layer-wise generalization dynamics in FL, revealing an early-emerging transition between generalizable (safe-to-federate) and task-specific (should-remain-local) layers. Building on this, we introduce Principled Layer-wise Federated Learning (PLayer-FL), which aims to deliver the benefits of federation more robustly. PLayer-FL computes a novel federation-sensitivity metric efficiently after a single training epoch to choose the optimal split point for a given task. Inspired by model pruning, the metric quantifies each layer’s robustness to aggregation and highlights where federation shifts from beneficial to detrimental. We show that this metric correlates strongly with established generalization measures across diverse architectures. Crucially, experiments demonstrate that PLayer-FL achieves consistently competitive performance across a wide range of tasks while distributing gains more equitably and reducing client-side regressions relative to baselines.
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Jianming Tong ⋅ Hanshen Xiao ⋅ ⋅ Hao Kang ⋅ Ashish Sirasao ⋅ Ziqi Zhang ⋅ G. Edward Suh ⋅ Tushar Krishna
Multi-user virtual reality (VR) applications such as football and concert experiences rely on real-time avatar reconstruction to enable immersive interaction. However, rendering avatars for numerous participants on each headset incurs prohibitive computational overhead, fundamentally limiting scalability. This work introduces a framework, Privatar, to offload avatar reconstruction from headset to untrusted devices within the same local network while safeguarding sensitive facial features against adversaries capable of intercepting offloaded data. Privatar builds on two insights. (1) **System level**. We observe identity-bearing information in facial inputs is highly skewed across frequency, and propose **Horizontal Partitioning (HP)** to keep the most identifying frequency components on-device and offloads only low-identifiability components. HP offloads local computation while preserves privacy against expression identification attacks. (2) **Privacy accounting level**. For **individually** offloaded, **multi-dimensional** signals without aggregation, worst-case local Differential Privacy requires prohibitive noise, ruining utility. We observe users’ expression statistical distribution are **stable over time**, and hence propose Distribution-Aware Minimal Perturbation (DAMP). DAMP minimizes noise based on each user’s expression distribution to significantly reduce its effects on utility and accuracy, retaining formal privacy guarantee. On a Meta Quest Pro, Privatar supports up to 2.37$\times$ more concurrent users at 5.7~6.5\% higher reconstruction loss and ~9\% energy overhead, providing a better Pareto frontier in Throughout-Loss over SotA quantization, sparsity, and local reconstruction baseline. Privatar further provides both provable privacy guarantee and stays robust against both empirical attack and NN-based Expression Identification Attack, proving its resilience in practice. Our code is open-sourced at https://anonymous.4open.science/r/Privatar-372A.
ProToken: Token-Level Attribution for Federated Large Language Models
Waris Gill ⋅ Ahmad Humayun ⋅ Ali Anwar ⋅ Muhammad Ali Gulzar
Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98.62% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.
RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
Yinsicheng Jiang ⋅ Yeqi Huang ⋅ Liang Cheng ⋅ Cheng Deng ⋅ Xuan Sun ⋅ Luo Mai
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing inference engines (SGLang and vLLM) and improves performance by 1.5–3× over state-of-the-art methods (CacheBlend, RadixCache, LMCache, HiCache, and RAGCache), while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads.
RagInfer: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Chien-Yu Lin ⋅ Keisuke Kamahori ⋅ Yiyu Liu ⋅ Xiaoxiang Shi ⋅ Madhav Kashyap ⋅ Yile Gu ⋅ Rulin Shao ⋅ Zihao Ye ⋅ Kan Zhu ⋅ Rohan Kadekodi ⋅ Stephanie Wang ⋅ Arvind Krishnamurthy ⋅ Luis Ceze ⋅ Baris Kasikci
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose RAGInfer, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements. The core innovation of RAGInfer is \emph{lookahead retrieval}, a prefetching mechanism that predicts required data and transfers them from CPU to GPU in parallel with LLM generation. In addition, RAGInfer adopts a prefetching scheduler and a cache-aware scheduler to support efficient multi-GPU inference with minimal overhead. Evaluations show RAGInfer achieves up to a 1.53$\times$ average end-to-end latency reduction (single-query) and 1.83$\times$ higher average throughput (batched), as well as good scalability in throughput. This confirms the practical utility of RAGInfer for faster and more memory-efficient deployments of RAG applications.
SONAR: Benchmarking Topology and Collaboration in Decentralized Learning
Joyce Yuan ⋅ Yichuan Shi ⋅ Abhishek Singh ⋅ Rishi Sharma ⋅ Ramesh Raskar ⋅ ⋅ Martin Jaggi
The performance, efficiency, and reliability of decentralized machine learning hinge on systems factors such as network topology, communication budget, and device heterogeneity—yet existing frameworks treat these as fixed or opaque. Federated learning remains centrally orchestrated, while peer-to-peer (P2P) approaches lack a unified foundation for analyzing how topology and system design jointly shape learning outcomes. We present \textbf{SONAR}, a systems framework for reproducible, topology-aware decentralized learning. SONAR unifies communication, topology, and telemetry in a layered architecture supporting multiple backends (gRPC, MPI, WebRTC), static and adaptive graphs, and per-node logging of bandwidth, latency, and collaboration dynamics. Using SONAR, we make three observations: (1) topology and its graph-level statistics show no consistent or linear correlation with learning performance across accuracy, robustness, and privacy metrics, underscoring the need to study topology as an independent systems variable; (2) under realistic constraints such as limited communication rounds or bandwidth, topology governs how quickly information propagates—producing up to ≈ 20% performance differences between graph families; and (3) adaptive neighbor selection can induce collaborator collapse—a failure mode where network diversity erodes over time. By exposing topology as a first-class experimental dimension, SONAR enables systematic, reproducible evaluation of decentralized learning across performance, efficiency, and robustness regimes.
Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token
Rajveer Bachkaniwala ⋅ ⋅ Richard So ⋅ Divya Mahajan ⋅ Kexin Rong
Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming–overlapping retrieval with inference–but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a system that extends vLLM to support streaming prompts with adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). Stream2LLM decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses cache invalidation based on longest common prefix matching to minimize redundant computation when prompts change dynamically. To evaluate Stream2LLM, we collect and characterize two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11× TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, while maintaining throughput parity with non-streaming baselines.
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation
Tianrui Feng ⋅ Zhi Li ⋅ Shuo Yang ⋅ Haocheng Xi ⋅ Muyang Li ⋅ Xiuyu Li ⋅ ⋅ Keting Yang ⋅ Kelly Peng ⋅ Song Han ⋅ Maneesh Agrawala ⋅ Kurt Keutzer ⋅ Akio Kodaira ⋅ Chenfeng Xu
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but has hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present \textbf{StreamDiffusionV2}, a \emph{training-free} pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token–guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1–4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs. Even when increasing denoising steps to improve quality, it sustains 31.62 FPS (14B) and 61.58 FPS (1.3B), making state-of-the-art generative live streaming practical and accessible—from individual creators to enterprise-scale platforms.
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips
Jiahuan Yu ⋅ Mingtao Hu ⋅ ⋅ Minjia Zhang
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, a high-performance rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.
TiDAR: Think in Diffusion, Talk in Autoregression
Jingyu Liu ⋅ Xin Dong ⋅ Zhifan Ye ⋅ ⋅ Yonggan Fu ⋅ vartika singh ⋅ Ce Zhang ⋅ Pavlo Molchanov
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TIDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free compute density on GPUs, achieving a strong balance between drafting and verification capacity. Moreover, we design TIDAR to be serving-friendly as a standalone model. We extensively evaluate TIDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at both 1.5B and 8B scales. Thanks to parallel drafting and sampling as well as efficient exact KV cache support, TIDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TIDAR is the first architecture to close the quality gap with AR models while delivering 4.71× to 5.91× more tokens per second.
TokenBlend: Accelerating Tensor Parallelism LLM Inference Through Efficient Compute-Communication Overlap
Raja Gond ⋅ Nipun Kwatra ⋅ Ramachandran Ramjee
Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20\%$ even over GPUs connected via NVLink. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these computation subtasks. However, as of this writing, none of the open-source LLM serving systems (vLLM, SGLANG, TensorRT-LLM) support compute-communication overlap for LLMs served using tensor parallelism. This is because the number of tokens processed per iteration is kept small to support low latency serving and decomposing these smaller workloads to enable communication overlap results in worse performance. We present TOKENBLEND, the first system to enable efficient compute-communication overlap for tensor-parallel models for token lengths as small as 1024. TOKENBLEND identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused \textbf{AllReduce--RMSNorm} kernel. Further, this kernel leverages the multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only 2--8 SMs. Our evaluations demonstrate up to $\boldsymbol{1.28\times}$ speedup in latency and $\boldsymbol{1.19\times}$ higher throughput across multiple models and workloads. In several settings, TOKENBLEND delivers \textit{better performance than an equivalent model with all communication removed}. The source code of TOKENBLEND is available at https://anonymous.4open.science/r/tokenblend-mlsys/.
Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
Shuyi Lin ⋅ Anshuman Suri ⋅ Alina Oprea ⋅ Cheng Tan
As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the \emph{jailbreak oracle problem}: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present BOA, the first system designed for efficiently solving the jailbreak oracle problem. BOA employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained safety scores to systematically explore promising yet low-probability paths. BOA enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.
TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference
Xianzhe Dong ⋅ Tongxuan Liu ⋅ Yuting Zeng ⋅ Weizhe Huang ⋅ ⋅ Siyu Wu ⋅ ⋅ Liu Yang ⋅ ⋅ Hailong Yang ⋅ ⋅ Jing Li
Existing MLLM inference systems are typically designed based on the architecture of language models, coupling image processing and language processing. This design struggles to accommodate the heterogeneous demands of different stages in terms of computational resources, memory access patterns, and service-level objectives (SLOs), leading to low resource utilization and high request latency, ultimately failing to meet the service requirements of diverse inference scenarios. To address these challenges, we propose TriInfer, an efficient MLLM inference system that adopts a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture. By scheduling the three stages — encode, prefill, and decode — onto separate heterogeneous inference instances, the system flexibly reallocates resources across stages, significantly reducing idle computation, alleviating resource bottlenecks, and improving overall system throughput and scalability. In addition, TriInfer supports a stage-level batching strategy that enhances load balancing, enables parallel execution of visual and language models, and further optimizes inference performance. Experiments under real multimodal inference workloads demonstrate that TriInfer can achieve up to 3.7× higher inference throughput compared to state-of-the-art systems (e.g., vLLM, SGLang) while meeting the 90th percentile request SLO.
VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Heng Ping ⋅ Arijit Bhattacharjee ⋅ Peiyu Zhang ⋅ Shixuan Li ⋅ Wei Yang ⋅ Anzhe Cheng ⋅ Xiaole Zhang ⋅ Jesse Thomason ⋅ Ali Jannesari ⋅ Nesreen Ahmed ⋅ Paul Bogdan
Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose \textbf{VeriMoA}, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a \textbf{quality-guided caching mechanism} to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a \textbf{multi-path generation strategy} that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that \ourtool achieves 15--30\% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
When Enough is Enough: Rank-Aware Early Termination for Vector Search
Jianan Lu ⋅ Asaf Cidon ⋅ Michael None Freedman
Graph-based vector search underpins modern LLM applications such as retrieval-augmented generation (RAG), but its efficiency is increasingly constrained by disk I/O. Existing systems continue searching long after discovering the higher-ranked (i.e., most valuable) results for downstream applications. We present Terminus, a rank-aware early termination mechanism that dynamically aligns I/O spending with application utility. Terminus models per-I/O search utility using a rank-weighted function and terminates once recent I/Os yield negligible utility gains. By prioritizing I/O toward results that matter most to downstream tasks, Terminus achieves a better performance–accuracy trade-off. It delivers up to 1.4× higher throughput at the same accuracy target compared to existing early termination schemes, and up to 3.2× higher throughput than a baseline without early termination, with minimal impact on RAG accuracy.
Zero redundancy distributed learning with differential privacy
Zhiqi Bu ⋅ Justin Chiu ⋅ Ruixuan Liu ⋅ Sheng Zha ⋅ George Karypis
Deep learning using large models has achieved great success in a wide range of domains. However, training these models on billions of parameters is very challenging in terms of training speed, memory cost, and communication efficiency, especially under the privacy-preserving regime with differential privacy (DP). On the one hand, the efficiency of DP optimization is comparable to that of standard non-DP optimization on a single GPU, but existing DP distributed learning is significantly inefficient on multiple GPUs. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed learning, which can be technically complicated to work compatibly with DP. In this work, we develop a new systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g. to GPT-100B, (II) to obtain the same computation and communication efficiency as the standard ZeRO, and (III) to enable mixed-precision DP training. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and exhibits excellent training efficiency on large models. Code at \url{https://anonymous.4open.science/r/fast-differential-privacy-3B50}.
ZK-APEX: ZERO-KNOWLEDGE APPROXIMATE PERSONALIZED UNLEARNING WITH EXECUTABLE PROOFS
Mohammad M Maheri ⋅ ⋅ ⋅ Hamed Haddadi
Machine unlearning removes the influence of specified data from trained models to satisfy privacy, copyright, and safety requirements (e.g., the “right to be forgotten”). In practice, providers distribute a global model to edge devices, that each locally personalize the model based on their private data. However, since clients may ignore or falsify deletion requests, providers must verify correct unlearning for these distributed models, without accessing private parameters. This is particularly challenging for personalized models, which must forget designated samples without degrading local utility, while ensuring that verification remains efficient and scalable on resource-constrained edge devices. We formalize personalized unlearning and develop a zero-shot approximate unlearning algorithm that works directly on the personalized model without retraining. Our novel method, \name, combines provider-side sparse masking for targeted removal with client-side Group-OBS compensation computed from a block-wise empirical Fisher. This technique yields a curvature-aware update designed for low-overhead execution and proof generation. Using modern Halo2 ZK-SNARKs, we prove operator compliance by showing that the unlearned model exactly matches the committed output of the prescribed transformation, without revealing personalized model parameters or data. On Vision Transformer (ViT) classification models, our approach recovers approximately 99\% Top-1 personalization accuracy while enforcing effective forgetting. We further evaluate the unlearning algorithm on a generative model, OPT125M, trained on the CodeParrot code dataset, achieving $\sim$70\% recovery of original accuracy. ZK-SNARK proof generation for the ViT case completes in $\approx$2~hours, which is more than $10^7\times$ faster than retraining based verification, with peak memory under 0.7~GB and proof sizes about 400~MB. Together, these results establish the first verifiable personalized unlearning framework practical for deployment on resource constrained edge devices.