SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Jameson Sandler ⋅ Jacob K Christopher ⋅ ⋅ Ferdinando Fioretto
Abstract
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: \textbf{(1)} the autoregressive dependency during drafting which limits parallelism, and \textbf{(2)} frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes \emph{SpecDiff-2}, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that \emph{SpecDiff-2} achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of $+55\%$ over previous baselines and obtaining up to $5.5\times$ average speed-up over standard decoding, without any loss of accuracy.
Chat is not available.
Successful Page Load