Oral Wed, May 20, 2026 • 2:15 PM – 2:30 PM PDT

MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation

Jinghan Yao ⋅ Sam Jacobs ⋅ Walid Krichene ⋅ Masahiro Tanaka ⋅ Dhabaleswar Panda

Project Page [ Slides] [ OpenReview]

Abstract

Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression (lowering fidelity) or selection/eviction (restricting what remains accessible), which can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity and access-preserving alternative that accelerates decode by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with a fresh attention computed on the KV tail, via a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of the context length. The method is model-agnostic, and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups (up to 2.6x end-to-end), while maintaining full-attention quality. By reusing computation rather than compressing or discarding tokens, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available.

Chat is not available.