Session
Research-Track Oral Presentation: R9: Agentic AI
Grand Ballroom 1
AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents
Hojoon Kim ⋅ Yuheng Wu ⋅ Thierry Tambe
Large language models (LLMs) have recently been integrated into embodied AI agents, yet their synchronous plan-act loop imposes severe latency and cost bottlenecks. We present AgenticCache, a cache-driven asynchronous planning framework that decouples LLM reasoning from real-time execution. By identifying strong plan transition locality in embodied tasks, AgenticCache enables agents to reuse frequently occurring plan fragments and update them asynchronously through a background LLM process. This design converts idle waiting time into productive action while preserving context-aware decision quality. Across four multi-agent embodied benchmarks, AgenticCache improves task success rates by 24.34%, reduces simulation latency by 75%, and lowers token usage by 65% on average. These results demonstrate that caching and asynchronous reasoning together offer a path toward real-time, low-cost, and cognitively inspired autonomy in LLM-based agents.
FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap
Taosong Fang ⋅ Zhen Zheng ⋅ Zhengzhao Ma ⋅ Yaojie Lu ⋅ Hongyu Lin ⋅ Xianpei Han ⋅ Le Sun
Large Language Models (LLMs) are increasingly deployed as collaborating agents in Multi-Agent Systems (MAS), where sequential agent interactions create significant latency bottlenecks. Traditional serving systems require each downstream agent to wait for complete upstream generation before starting prefill, leaving substantial idle time during inter-agent transitions. We present FlashAgents, a system that accelerates multi-agent workflows through token-level streaming and prefix-aware coordination. FlashAgents introduces Inter-agent streaming and incremental prefill, which streams tokens between agents and performs incremental prefill to overlap downstream prefill with upstream decode, reducing inter-agent latency. For concurrent workloads, an intra-turn prefix cache built on radix trees detects and eliminates redundant prefill across requests sharing common instruction templates, avoiding recomputation of shared prefixes within the same processing turn. Implemented on SGLang, FlashAgents achieves up to 46\% end-to-end latency reduction on real workflows and 3.5$\times$ speedup in controlled two-agent benchmarks, demonstrating consistent improvements across diverse models and interaction patterns.
Hippocampus: An Efficient and Scalable Memory Module for Agentic AI
Yi Li ⋅ Lianjie Cao ⋅ Faraz Ahmed ⋅ Puneet Sharma ⋅ Bingzhe Li
Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce \textbf{Hippocampus}, an agentic memory management system that uses compact binary signatures for semantic search and lossless token-ID streams for exact content reconstruction. Its core is a Dynamic Wavelet Matrix (DWM) that compresses and co-indexes both streams to support ultra-fast search in the compressed domain, thus avoiding costly dense-vector or graph computations. This design scales linearly with memory size, making it suitable for long-horizon agentic deployments.
Retrieval-augmented generation (RAG) enables LLMs to ground responses in external knowledge, but long-term, multi-session conversations still suffer from implicit recall failures: when current user queries lack lexical overlap with earlier facts (e.g., preferences), standard dense retrieval and long-context prompting often miss the most relevant memories. We present a dialogue-aware RAG system that jointly addresses what to store and how to retrieve under constraints. Our design extracts durable user facts into a lightweight memory graph, enriches queries with conversational cues, performs hybrid retrieval, and uses a budget-aware router to balance quality and serving cost. On our Implicit Preference Recall benchmark, the system lifts Recall@10 to 0.70 (vs. 0.58 for dense-only) and improves nDCG@10 from 0.41 to 0.51. The system also reduces cross-modality disagreement by 47% and achieves a 81% cost reduction compared to long-context methods.
RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
Yinsicheng Jiang ⋅ Yeqi Huang ⋅ Liang Cheng ⋅ Cheng Deng ⋅ Xuan Sun ⋅ Luo Mai
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing inference engines (SGLang and vLLM) and improves performance by 1.5–3× over state-of-the-art methods (CacheBlend, RadixCache, LMCache, HiCache, and RAGCache), while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads.
RagInfer: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Chien-Yu Lin ⋅ Keisuke Kamahori ⋅ Yiyu Liu ⋅ Xiaoxiang Shi ⋅ Madhav Kashyap ⋅ Yile Gu ⋅ Rulin Shao ⋅ Zihao Ye ⋅ Kan Zhu ⋅ Rohan Kadekodi ⋅ Stephanie Wang ⋅ Arvind Krishnamurthy ⋅ Luis Ceze ⋅ Baris Kasikci
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose RAGInfer, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements. The core innovation of RAGInfer is \emph{lookahead retrieval}, a prefetching mechanism that predicts required data and transfers them from CPU to GPU in parallel with LLM generation. In addition, RAGInfer adopts a prefetching scheduler and a cache-aware scheduler to support efficient multi-GPU inference with minimal overhead. Evaluations show RAGInfer achieves up to a 1.53$\times$ average end-to-end latency reduction (single-query) and 1.83$\times$ higher average throughput (batched), as well as good scalability in throughput. This confirms the practical utility of RAGInfer for faster and more memory-efficient deployments of RAG applications.