Poster
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar · Srikant Bharadwaj · Renée St. Amant · Victor Ruehle · Saravan Rajmohan
Transformer-based large language models are memory hungry and incur significant inference latencies evenon cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attentionoperation is quadratic in terms of the total context length, i.e., prompt and output tokens.To that end, we propose LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanismfor the decode-phase of transformer-based models. LeanAttention enables scaling the attention mechanism for thechallenging case of long context lengths by re-designing the attention execution flow for the decode-phase. As aresult, we achieve an average of 1.73x speedup in attention execution compared to FlashDecoding, with up to2.18x speedup for 256k context length.