Session
Industry-Track Oral Presentation: I2: LLM Training
Grand Ballroom 1
AXLearn: Modular, Hardware-Agnostic Large Model Training
Mark Lee ⋅ Tom Gunter ⋅ Chang Lan ⋅ ⋅ Hanzhi Zhou ⋅ ⋅ Sneha Bangalore ⋅ ⋅ ⋅ Xianzhi Du ⋅ Philipp Dufter ⋅ ⋅ Ruixuan Hou ⋅ Haoshuo Huang ⋅ ⋅ Xiang Kong ⋅ Jinhao Lei ⋅ Tao Lei ⋅ Meng Li ⋅ Li Li ⋅ Jiarui Lu ⋅ Zhiyun Lu ⋅ ⋅ ⋅ ⋅ ⋅ Zhucheng Tu ⋅ Chong Wang ⋅ Jianyu Wang ⋅ ⋅ Zirui Wang ⋅ ⋅ Sam Wiseman ⋅ Guoli Yin ⋅ ⋅ Xiyou Zhou ⋅ Danyang Zhuo ⋅ ⋅ Ruoming Pang
AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn at Apple.
FlexScale: Flexible and High-Performance FSDP at Scale
Zezhou Wang ⋅ Youjie Li ⋅ Zhiqi Lin ⋅ Jiacheng Yang ⋅ Cong Xie ⋅ ⋅ ZHENG ZHONG ⋅ ⋅ Hongyu Zhu ⋅ Zhi Zhang ⋅ Xin Liu ⋅ Yanghua Peng
Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods—e.g., block-wise quantized training—and with optimizers such as Shampoo and Muon used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today’s implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce FlexScale, a redesigned FSDP framework that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. FlexScale natively supports efficient data placement required by FSDP, accommodates non-element-wise optimizers and block-wise quantization. As a result, FlexScale achieves 5$\sim$66\% higher throughput and 16$\sim$30\% lower memory usage than existing FSDP systems, while efficiently scales to 30K GPUs. FlexScale has been battle-tested in production and will be open-sourced to the MLSys community upon acceptance.
FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost
Chenhao Feng ⋅ Haoli Zhang ⋅ Shakhzod Ali-zade ⋅ Yanli Zhao ⋅ Liang Luo ⋅ Jennifer Cao ⋅ Lisen Deng ⋅ ⋅ Chenyu Zhao ⋅ ⋅ ⋅ ⋅ Tiantu Xu ⋅ Yi Zhang ⋅ Evgenii Kolpakov ⋅ Siqi Yan ⋅ Chuanhao Zhuge ⋅ Min Ni ⋅ Bi Xue ⋅ Qunshu Zhang ⋅ Shen Li
Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the strag- gler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
Jiyuan Zhang ⋅ Yining Liu ⋅ Siqi Yan ⋅ Lisen Deng ⋅ Jennifer Cao ⋅ Shuqi Yang ⋅ Bi Xue ⋅ Min Ni ⋅ Shen Li
The pervasive “memory wall” bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads—driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over $4\times$ speedups and over $50\%$ memory savings compared to existing MoE frameworks. MoEBlaze has been deployed in Meta recommendation production.
NodeSweep: Practical Straggler Detection and Health Monitoring for Large-Scale Foundation Model Training
Guanliang Liu ⋅ Zoe Zeng ⋅ ⋅ Cong Cheng ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Alexander Zhipa ⋅ Ashvin Nihalani ⋅ Binxuan Huang ⋅ ⋅
As foundation model training scales to thousands of GPUs, maintaining consistent node performance becomes increasingly critical. Traditional health-checking methods such as NCCL or burn-in tests often fail to capture subtle performance degradations that can significantly impact large-scale training efficiency. In this paper, we present a comprehensive node health monitoring framework that integrates real-time performance tracking with a novel offline node sweep mechanism. Our approach effectively identifies problematic nodes that traditional methods overlook, especially under complex communication patterns common in distributed training. Extensive evaluations on production workloads show that our method improves mean FLOPS utilization (MFU) by up to 1.7×, reduces run-to-run variance from 20% to 1%, and increases the mean time to failure (MTTF) while reducing human intervention time. These improvements translate to substantial gains in training efficiency. The proposed solution is both practical and scalable, making it particularly valuable for production-scale foundation model training.
Sparing Strategies to Minimize Reliability Impact On Large Training Jobs
⋅ ⋅ Ehsan K. Ardestani ⋅ ⋅ ⋅ ⋅ Zhaodong Wang ⋅ ⋅ Xu Zhang ⋅ ⋅ Ying Zhang
Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing strategy, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly impacts cluster performance and reliability. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta’s AI operations.