NodeSweep: Practical Straggler Detection and Health Monitoring for Large-Scale Foundation Model Training
Abstract
As foundation model training scales to thousands of GPUs, maintaining consistent node performance becomes increasingly critical. Traditional health-checking methods such as NCCL or burn-in tests often fail to capture subtle performance degradations that can significantly impact large-scale training efficiency. In this paper, we present a comprehensive node health monitoring framework that integrates real-time performance tracking with a novel offline node sweep mechanism. Our approach effectively identifies problematic nodes that traditional methods overlook, especially under complex communication patterns common in distributed training. Extensive evaluations on production workloads show that our method improves mean FLOPS utilization (MFU) by up to 1.7×, reduces run-to-run variance from 20% to 1%, and increases the mean time to failure (MTTF) while reducing human intervention time. These improvements translate to substantial gains in training efficiency. The proposed solution is both practical and scalable, making it particularly valuable for production-scale foundation model training.