Skip to yearly menu bar Skip to main content


Poster

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye · Lequn Chen · Ruihang Lai · Wuwei Lin · Yineng Zhang · Stephanie Wang · Tianqi Chen · Baris Kasikci · Vinod Grover · Arvind Krishnamurthy · Luis Ceze


Abstract:

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for usable inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present Focus: a customizable and efficient attention engine for LLM serving. Focus tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, Focus' load-balanced scheduling adjusts to input dynamism while maintaining compatibility with CUDAGraph.We integrate Focus into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate Focus's ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, Focus achieve 29-69\% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.

Chat is not available.