Session
Industry Track Oral Presentation: LLM Training 4
Grand Ballroom 1
Moderator: Yanghua Peng
Sparing Strategies to Minimize Reliability Impact On Large Training Jobs
Kevin Quirk ⋅ Matthew Lennie ⋅ Ehsan K. Ardestani ⋅ Satyajeet Ahuja ⋅ Matthew Bergeron ⋅ Andrew Grier ⋅ Zhaodong Wang ⋅ Mustafa Ozdal ⋅ Xu Zhang ⋅ Abhinav Triguna ⋅ Ying Zhang ⋅ Mathew Oldham ⋅ Chunqiang Tang
Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly impacts cluster performance. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta’s AI operations.
veScale-FSDP: Flexible and High-Performance FSDP at Scale
Zezhou Wang ⋅ Youjie Li ⋅ Zhiqi Lin ⋅ Jiacheng Yang ⋅ Cong Xie ⋅ Guanyu Feng ⋅ ZHENG ZHONG ⋅ Ziyue Huang ⋅ Hongyu Zhu ⋅ Zhi Zhang ⋅ Yanghua Peng ⋅ Xin Liu
Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.
AXLearn: Modular, Hardware-Agnostic Large Model Training
Mark Lee ⋅ Chang Lan ⋅ Tom Gunter ⋅ John Peebles ⋅ Hanzhi Zhou ⋅ Xuan Zou ⋅ Sneha Bangalore ⋅ Chung-Cheng Chiu ⋅ Nan Du ⋅ Xianzhi Du ⋅ Philipp Dufter ⋅ Liang He ⋅ Ruixuan Hou ⋅ Haoshuo Huang ⋅ Dongseong Hwang ⋅ Xiang Kong ⋅ Jinhao Lei ⋅ Tao Lei ⋅ Meng Li ⋅ Li Li ⋅ Jiarui Lu ⋅ Zhiyun Lu ⋅ Yiping Ma ⋅ David Qiu ⋅ Vivek Rathod ⋅ Senyu Tong ⋅ Zhucheng Tu ⋅ Chong Wang ⋅ Jianyu Wang ⋅ Yongqiang Wang ⋅ Zirui Wang ⋅ Floris Weers ⋅ Sam Wiseman ⋅ Guoli Yin ⋅ Bowen Zhang ⋅ Xiyou Zhou ⋅ Danyang Zhuo ⋅ Cheng Leong ⋅ Ruoming Pang
AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-the-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundreds of modules with just 10 lines of code, compared to hundreds, as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. We also share our experience in the development and operation of AXLearn at Apple.
GUARD: SCALABLE STRAGGLER DETECTION AND NODE HEALTH MANAGEMENT FOR LARGE-SCALE TRAINING
guanliang liu ⋅ Abhinandan Patni ⋅ congzhu lin ⋅ Zoe Zeng ⋅ Jack Wittmayer ⋅ Yinghong Liu ⋅ josh wu ⋅ Anthony Ko ⋅ Alexander Zhipa ⋅ Ashvin Nihalani ⋅ Binxuan Huang ⋅ Cong Cheng ⋅ Mi Sun ⋅ Vijay rajakumar ⋅ Rejith Joseph ⋅ Parthasarathy Govindarajen
Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7×, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.
MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs
Jiyuan Zhang ⋅ Yining Liu ⋅ Siqi Yan ⋅ Lisen Deng ⋅ Jennifer Cao ⋅ Shuqi Yang ⋅ Bi Xue ⋅ Min Ni ⋅ Shen Li
The pervasive “memory wall” bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads—driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over $4\times$ speedups and over $50\%$ memory savings compared to existing MoE frameworks. MoEBlaze has been deployed in Meta recommendation production.
FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost
Chenhao Feng ⋅ Haoli Zhang ⋅ Shakhzod Ali-zade ⋅ Yanli Zhao ⋅ Liang Luo ⋅ Jennifer Cao ⋅ Lisen Deng ⋅ Siqiao Chen ⋅ Chenyu Zhao ⋅ Tristan Rice ⋅ Daniel Johnson ⋅ Min Si ⋅ Tiantu Xu ⋅ Yi Zhang ⋅ Evgenii Kolpakov ⋅ Siqi Yan ⋅ Chuanhao Zhuge ⋅ Min Ni ⋅ Bi Xue ⋅ Qunshu Zhang ⋅ Shen Li
Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the strag- gler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.