Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Abstract
Reinforcement learning (RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck—the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall-clock time—and a complementary opportunity—the availability of historical rollouts that reveal stable prompt-level patterns across training epochs. Motivated by these observations, we propose \textbf{DAS, a Distribution-Aware Speculative decoding framework} that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: a \textbf{self-evolving, nonparametric drafter} built from recent rollouts using an incrementally maintained suffix tree, and a \textbf{length-aware speculation policy} that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token-level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time by over 30\% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post-training without compromising learning quality.