Session
Research-Track Oral Presentation: R12: LLM Training
Grand Ballroom 2
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Zelei Shao ⋅ Vikranth Srivatsa ⋅ Junxiong Wang ⋅ Chenfeng Xu ⋅ Xiaoxia Wu ⋅ ⋅ ⋅ Qingyang Wu ⋅ Jue Wang ⋅ Ameen Patel ⋅ Yiying Zhang ⋅ Percy Liang ⋅ Tri Dao ⋅ Ben Athiwaratkun ⋅ Ce Zhang
Reinforcement learning (RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck—the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall-clock time—and a complementary opportunity—the availability of historical rollouts that reveal stable prompt-level patterns across training epochs. Motivated by these observations, we propose \textbf{DAS, a Distribution-Aware Speculative decoding framework} that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: a \textbf{self-evolving, nonparametric drafter} built from recent rollouts using an incrementally maintained suffix tree, and a \textbf{length-aware speculation policy} that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token-level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time by over 30\% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post-training without compromising learning quality.
FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy
Weilin Cai ⋅ Diandian Gu ⋅ Jun Wang ⋅ ⋅ Jiayi Huang
Large language model (LLM) training has become a critical workload in shared GPU clusters. However, our observations reveal that these clusters suffer from significant underutilization. To address this inefficiency, various elastic training techniques have been developed to dynamically adjust GPU allocations to harness idle resources. Despite their potential, these methods have seen limited deployment in production environments due to three major challenges: accuracy inconsistency, excessive profiling overhead, and limited flexibility. In this paper, we propose FlexTrain, an elastic training system that achieves consistent model accuracy, high training efficiency, and effective resource utilization. FlexTrain prioritizes adjustments to the pipeline parallelism (PP) degree to preserve deterministic computation and maintain accuracy consistency, while also supporting data parallelism (DP) scaling to further enhance throughput under relaxed consistency requirements. It generates optimal PP schedules, predicts training performance under different configurations, and makes scaling decisions based on job submission intervals, scaling overhead, and expected throughput gains. Evaluation results show that FlexTrain can achieve up to 1.73x speedup for elastic jobs while preserving consistent accuracy, and up to 2.27x when accuracy consistency is relaxed, compared to CompanyX's current scheduling strategy.
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
Yongjun He ⋅ Shuai Zhang ⋅ ⋅ Xiyuan Zhang ⋅ Boran Han ⋅ Bernie Wang ⋅ Huzefa Rangwala ⋅ George Karypis
As large language models (LLMs) scale and new GPUs are released even more frequent, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs across regions and alleviate the shortage of homogeneous high-end GPUs in a single region. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and network. HetRL formulates RL training scheduling in heterogeneous environments as a constrained joint optimization problem and introduces a novel scheduling algorithm that (1) decomposes the complex search space with a multi-level search framework; and (2) allocates the search budget via successive halving. Our extensive evaluation consuming 20,000 GPU-hours shows that HetRL delivers up to 9.17× and 3.17× on average the throughput of state-of-the-art systems under various workloads and settings.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Justin Bauer ⋅ Thomas Walshe ⋅ Derek Pham ⋅ Harit Vishwakarma ⋅ Armin Parchami ⋅ Frederic Sala ⋅ Paroma Varma
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with both dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) RLVR enables models trained on lower complexity tasks to generalize to higher complexity tasks, and (3) training on mixed complexity datasets offers the greatest benefits in low data regimes, providing up to 5$\times$ sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
NexSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
qiaoling chen ⋅ Zijun Liu ⋅ Peng Sun ⋅ Shenggui Li ⋅ Guoteng Wang ⋅ Ziming Liu ⋅ Yonggang Wen ⋅ Siyuan Feng ⋅ Tianwei Zhang
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naïve integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present NexSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B–14B), NexSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
Pylo: Towards Accessible Learned Optimizers in PyTorch
Paul Janson ⋅ Benjamin Thérien ⋅ Quentin Anthony ⋅ Xiaolong Huang ⋅ Abhinav Moudgil ⋅ Eugene Belilovsky
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances such as VeLO, which was meta-trained for 4000 TPU-months, remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for independently using the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the remaining ≈ 80% of machine learning community via the familiar torch.optim.Optimizer interface. Unlike prior work focused on limited-scale academic tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our systems contribution includes CUDA-accelerated implementations of the small fc lopt(Metz et al., 2022a) and VeLO(Metz et al., 2022b) learned optimizers, achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 and 49.73 to 205.59 and 191.18 samples per second, respectively. PyLO has the versatility that allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we discover that learned optimizers can substantially benefit from it. Our code is available at https://anonymous.4open.science/r/pylo-C91E32