Skip to yearly menu bar Skip to main content


Session

Research-Track Oral Presentation: R3: LLM Serving

Grand Ballroom 1
Wed 20 May 1 p.m. PDT — 2:30 p.m. PDT
Abstract:
Chat is not available.


FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Nazmul Takbir ⋅ Hamidreza Koshkak ⋅ Nikil Dutt ⋅ Sangeetha Abdu Jyothi

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to \textbf{70\%}, improves offline serving throughput by \textbf{1.38–1.55×}, and lowers online token latency by \textbf{1.6–2.1×}, all while maintaining accuracy in long-context, long-generation scenarios.


Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

Haojun Xia ⋅ Xiaoxia Wu ⋅ Jisen Li ⋅ Tsai-chuan Wu ⋅ Junxiong Wang ⋅ Jue Wang ⋅ Chenxi Li ⋅ Aman Singhal ⋅ Alay Dilipbhai Shah ⋅ ⋅ Donglin Zhuang ⋅ Zhongzhu Zhou ⋅ Ben Athiwaratkun ⋅ Zhen Zheng ⋅ Shuaiwen Song

The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm–system co-design for mixed-precision KV caching: \emph{Kitty}. On the algorithm side, extensive experiments show that \emph{Dynamic Channel-wise Precision Boost} — which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision — maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. \emph{Kitty} addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), \emph{Kitty} cuts KV memory by nearly $8\times$ with negligible accuracy loss, enabling up to $8\times$ larger batches and $2.1\times$–$4.1\times$ higher throughput under the same memory budget.


MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation

Jinghan Yao ⋅ Sam Jacobs ⋅ Walid Krichene ⋅ Masahiro Tanaka ⋅ Dhabaleswar Panda

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression (lowering fidelity) or selection/eviction (restricting what remains accessible), which can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity and access-preserving alternative that accelerates decode by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with a fresh attention computed on the KV tail, via a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of the context length. The method is model-agnostic, and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups (up to 2.6x end-to-end), while maintaining full-attention quality. By reusing computation rather than compressing or discarding tokens, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available.


OPKV: A High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems

Huazheng Lao ⋅ Xiaofeng Li ⋅ Rui Xu ⋅ Long Chen ⋅ Xia Zhu ⋅

Long-context large language model (LLM) inference faces severe KV cache inflation, making GPU memory a key bottleneck. Existing recallable sparsity methods mitigate memory pressure by offloading non-critical key–value (KV) pairs to CPU memory and recalling them on demand, they are intrusive to KV cache management in the existing inference frameworks and fail to cope with the linearly increasing recall overhead under high batches. To address these limitations, we propose OPKV, a high-throughput plugin-driven framework that seamlessly integrates recallable sparsity into paged KV cache systems and performs unified recall optimization. OPKV introduces a plugin interface that decouples sparsity logic from model and cache management, and applies object reaggregation and hot page hit algorithms to reduce the recall overhead based on the observation of spatial discreteness and temporal locality in critical KV selection. In addition, a local intra-iteration metadata manager is implemented to perform millisecond-level page retrieval and cache eviction. The experimental results show that OPKV helps the SoTA methods attain 1.36-1.77x higher decoding throughput under different batches.


SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian ⋅ Seyedarmin Azizi ⋅ Yequan Zhao ⋅ Erfan Potraghloo ⋅ Sean McPherson ⋅ Sharath Nittur Sridhar ⋅ Zhengyang Wang ⋅ zheng Zhang ⋅ Massoud Pedram ⋅ Souvik Kundu

Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.


Using Span Queries to Optimize Cache and Attention Locality

Paul Castro ⋅ Nick Mitchell ⋅ Nathan Ordonez ⋅ Thomas Parnell ⋅ Mudhakar Srivatsa ⋅ Antoni Viros i Martin

Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the \emph{span query} to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter --- do they \emph{commute}? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve \emph{attention locality}, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.