Session
Research-Track Oral Presentation: R13: LLM Serving
Grand Ballroom 2
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Yilong Zhao ⋅ Jiaming Tang ⋅ Kan Zhu ⋅ Zihao Ye ⋅ Chi-Chih Chang ⋅ Chaofan Lin ⋅ Jongseok Park ⋅ Guangxuan Xiao ⋅ Mohamed Abdelfattah ⋅ Mingyu Gao ⋅ Baris Kasikci ⋅ Song Han ⋅ Ion Stoica
Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SpecGen, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SpecGen features a novel sparse attention mechanism \textit{PillarAttn} as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SpecGen co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SpecGen outperforms state-of-the-art solutions, with an up to $2.13\times$ throughput speedup.
CDLM: CONSISTENCY DIFFUSION LANGUAGE MODELS FOR FASTER SAMPLING
Minseo Kim ⋅ Chenfeng Xu ⋅ Coleman Hooper ⋅ Harman Singh ⋅ Ben Athiwaratkun ⋅ Ce Zhang ⋅ Kurt Keutzer ⋅ Amir Gholami
Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and an inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6×-12.8× lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://anonymous.4open.science/r/ConsistencyDLManonymous-3E88/.
PRISM: PARAMETRICALLY RESTRUCTURED INFERENCE FOR SPECULATIVE SAMPLING DRAFT MODELS
Xuliang Wang ⋅ Yuetao Chen ⋅ Maochan Zhen ⋅ Fang LIU ⋅ Xinzhou Zheng ⋅ Xingwu Liu ⋅ Hong Xu ⋅ Ming Li
Large Language Models (LLMs), constrained by their auto-regressive nature, have long suffered from expensive and slow decoding. Speculative sampling methods, capable of alleviating the memory bandwidth bottleneck, have attracted attention from both the system and AI research communities. The demand for high predictive performance has created a growing trend of training parametrically larger and more powerful draft models, which also introduces growing computation overhead. While existing works balance trade-offs to find a sweet spot, in this paper we dive further into this effectiveness and efficiency dilemma, addressing the issue with architectural innovation. By disaggregating the computation of each predictive step across different parameter sets, we restructure the computational paths for the draft models, successfully decoupling the representation capacity from the inference cost, which enables the model scalable and fast at the same time. We conduct extensive experiments showing that our PRISM drafter outperforms SoTA draft architectures on acceptance length and end-to-end throughput when trained with the same dataset. We also show that PRISM scales exceptionally well on large datasets while some other architectures fail. On average, PRISM speculative decoding can achieve more than 2.6x end-to-end speedup when integrated with an already highly optimized inference engine.
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding
Jameson Sandler ⋅ Jacob K Christopher ⋅ ⋅ Ferdinando Fioretto
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups. Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: \textbf{(1)} the autoregressive dependency during drafting which limits parallelism, and \textbf{(2)} frequent rejections of draft tokens caused by misalignment between the draft and verify models. This paper proposes \emph{SpecDiff-2}, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that \emph{SpecDiff-2} achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of $+55\%$ over previous baselines and obtaining up to $5.5\times$ average speed-up over standard decoding, without any loss of accuracy.
Speculative Decoding: Performance or Illusion?
Lily Liu ⋅ Jiaxiang Yu ⋅ Jongseok Park ⋅ Alvin Cheung ⋅ Ion Stoica
Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.