Processing math: 100%
Skip to yearly menu bar Skip to main content


Poster

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu · Jiangfei Duan · Chang Chen · Siran Liu · Xiuhong Li · Guanyu Feng · Xin Lv · Xiao Chuanfu · Dahua Lin · Chao Yang


Abstract: Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Exisiting sparse attention approaches employ either static sparse pattern or fixed sparsity ratio to utilize the high attention sparsity, failing to capture the adaptive sparsity ratio and dynamic sparse pattern across attention heads, input contents and model architectures. To balance accuracy and performance efficiently, we introduce a robust indicator for accuracy, Cumulative Residual Attention (CRA), which measures the percentage of attention recall. Leveraging this key insight, we present SampleAttention, which employs a novel two-stage query-guided key-value filtering approach to efficiently and dynamically select a minimal set of important column and slash strips to meet a desired CRA threshold, thus maximizing efficiency while preserving accuracy. Comprehensive evaluations show that SampleAttention can establish a new Pareto frontier in the accuracy-efficiency trade-off, and reduces TTFT by up to 5.29× compared with FlashAttention2.

Chat is not available.