Poster

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang ⋅ Junxian Guo ⋅ Haotian Tang ⋅ Qinghao Hu ⋅ Guangxuan Xiao ⋅ Jiaming Tang ⋅ Yujun Lin ⋅ Zhijian Liu ⋅ Yao Lu ⋅ Song Han

Project Page [ Slides] [ OpenReview]

Abstract

Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via unified sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. For Llama-3-8B, LServe accelerates LLM prefilling by an average of 2.4x and decoding by up to 3.3x over TensorRT-LLM, maintaining long-context accuracy. The code will be released upon publication.

Video

Chat is not available.