Poster 33

FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy

Weilin Cai ⋅ Diandian Gu ⋅ Baoquan Zhong ⋅ Jun Wang ⋅ Zhuolin Zheng ⋅ Gaohong Liu ⋅ Jiang Kaihua ⋅ Shuguang Wang ⋅ Wencong Xiao ⋅ Jiayi Huang

[ OpenReview]

Abstract

Large language model (LLM) training has become a critical workload in shared GPU clusters. However, our observations reveal that these clusters suffer from significant underutilization. To address this inefficiency, various elastic training techniques have been developed to dynamically adjust GPU allocations to harness idle resources. Despite their potential, these methods have seen limited deployment in production environments due to three major challenges: accuracy inconsistency, excessive profiling overhead, and limited flexibility. In this paper, we propose FlexTrain, an elastic training system that achieves consistent model accuracy, high training efficiency, and effective resource utilization. FlexTrain prioritizes adjustments to the pipeline parallelism (PP) degree to preserve deterministic computation and maintain accuracy consistency, while also supporting data parallelism (DP) scaling to further enhance throughput under relaxed consistency requirements. It generates optimal PP schedules, predicts training performance under different configurations, and makes scaling decisions based on job submission intervals, scaling overhead, and expected throughput gains. Evaluation results show that FlexTrain can achieve up to 1.73$\times$ speedup for elastic jobs while preserving consistent accuracy, and up to 2.27$\times$ when accuracy consistency is relaxed, compared to conventional non-elastic scheduling strategy.

Chat is not available.