Session
Research Track Oral Presentation: LLM Serving 5
Grand Ballroom 2
Moderator: Xiang Song
PRISM: Parametrically Refactor Inference for Speculative Decoding Draft Models
Xuliang Wang ⋅ Yuetao Chen ⋅ Maochan Zhen ⋅ Fang LIU ⋅ Xinzhou Zheng ⋅ Xingwu Liu ⋅ Hong Xu ⋅ Ming Li
Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6×.
CDLM: Consistency Diffusion Language Models for Faster Sampling
Minseo Kim ⋅ Chenfeng Xu ⋅ Coleman Hooper ⋅ Harman Singh ⋅ Ben Athiwaratkun ⋅ Ce Zhang ⋅ Kurt Keutzer ⋅ Amir Gholami
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6×-14.5× lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
Speculative Decoding: Performance or Illusion?
Xiaoxuan Liu ⋅ Jiaxiang Yu ⋅ Jongseok Park ⋅ Ion Stoica ⋅ Alvin Cheung
Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Jameson Sandler ⋅ Jacob K Christopher ⋅ Tom Hartvigsen ⋅ Ferdinando Fioretto
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: \textbf{(1)} the autoregressive dependency during drafting which limits parallelism, and \textbf{(2)} frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes \emph{SpecDiff-2}, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that \emph{SpecDiff-2} achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of $+55\%$ over previous baselines and obtaining up to $5.5\times$ average speed-up over standard decoding, without any loss of accuracy.
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Yilong Zhao ⋅ Jiaming Tang ⋅ Kan Zhu ⋅ Zihao Ye ⋅ Chi-Chih Chang ⋅ Chaofan Lin ⋅ Jongseok Park ⋅ Guangxuan Xiao ⋅ Mohamed Abdelfattah ⋅ Mingyu Gao ⋅ Baris Kasikci ⋅ Song Han ⋅ Ion Stoica
Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SpecGen, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SpecGen features a novel sparse attention mechanism \textit{PillarAttn} as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SpecGen co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SpecGen outperforms state-of-the-art solutions, with an up to $2.13\times$ throughput speedup.