Skip to yearly menu bar Skip to main content


Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Zhenyu Zhang · Shiwei Liu · Runjin Chen · Bhavya Kailkhura · Beidi Chen · Atlas Wang

[ ]
Tue 14 May 1:30 p.m. PDT — 1:50 p.m. PDT

Abstract: This paper focuses on addressing the substantial memory footprints and bandwidth costs associated with the deployment of Large Language Models (LLMs). LLMs, characterized by their extensive context length (e.g., $\geq$4096), inherently demands vast memory resource and traffic to store and load the attention key and value embeddings within self-attention modules, referred to as the KV cache. In an effort to alleviate these resource-intensive aspects of LLM inference, techniques such as sparsification and quantization for KV cache reduction have been investigated as separate endeavors within the realm of LLMs. However, this paper illuminates the critical importance of considering the compound effects of these techniques when employed together, as a simplistic amalgamation of sparsification and quantization can yield sub-optimal performance.For instance, the "Heavy Hitter Oracle" has demonstrated that preserving just 20\% of the KV cache attributed to pivotal tokens, denoted as "Heavy Hitters", can yield substantial memory savings while upholding the model's original performance. Furthermore, the KV cache of these "Heavy Hitter" tokens, which are identified as those with the highest accumulated attention scores, can be further quantized with encouraging throughput saving.Nevertheless, our investigation uncovers two primary deficiencies in such unrefined post-sparsification quantization in low-bit scenarios: (1) the application of low-bit KV cache quantization, specifically $\leq$ 4-bit, significantly diminishes the accuracy of Heavy Hitter selection during the generation phase, particularly in deeper layers; (2) tokens selected by the "Heavy Hitter Oracle" are not necessarily well-suited for quantization, and their quantization can lead to sub-optimal performance. To surmount these challenges, we propose a novel rule-of-thumb for token selection during LLM generation, termed Q-Hitter. This approach combines both accumulated attention scores and "Quantization Friendliness" metrics for different layers, identifying tokens that are not only pivotal for preserving the generalization capabilities of LLMs but are also more amenable to KV cache quantization. Q-Hitter naturally offers a free lunch of KV cache quantization and can further escalate the affordability of state-of-the-art LLMs. Additionally, Q-Hitter empowers LLMs to effectively handle inputs of infinite sequence length. Extensive experiments conducted across various LLMs and tasks substantiate the superiority of the proposed Q-Hitter framework over the original H$_2$O framework. Remarkably, Q-Hitter achieves full model quality preservation while delivering up to a remarkable 20$\times$ reduction in memory usage and up to 33$\times$, 33$\times$, 4$\times$ and 1.3$\times$ throughput improvements compared with the Hugginface Accelerate, DeepSpeed, FlexGen and $\mathsf{H_2O}$, respectively. The code will be public upon acceptance.

Chat is not available.