Poster 13

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Jiayi Tian ⋅ Seyedarmin Azizi ⋅ Yequan Zhao ⋅ Erfan Potraghloo ⋅ Sean McPherson ⋅ Sharath Nittur Sridhar ⋅ Zhengyang Wang ⋅ zheng Zhang ⋅ Massoud Pedram ⋅ Souvik Kundu

[ Slides] [ OpenReview]

Abstract

Large reasoning models (LRMs) often incur significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning. This incurs both memory overhead and throughput bottlenecks, limiting efficient deployment. To reduce KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and reduced effective KV budget caused by padding, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in multi-batch settings. Additionally, these methods often generate longer sequences than the original model without eviction, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method that performs selective \textit{eviction} and \textit{generation}, operating at a coarse-grained, sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference, enforcing the LRM to generate concise responses. Extensive evaluations on multiple reasoning benchmarks demonstrate that SkipKV achieves up to $\mathbf{26.7}\%$ higher accuracy compared to baseline methods, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ shorter generation length while improving throughput by up to $\mathbf{1.7}\times$. Our code is released at: \href{https://github.com/TTTTTTris/SkipKV}{https://github.com/TTTTTTris/SkipKV}.

Chat is not available.