Poster
Morphling: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
Xinyi Zhang · Hanyu Zhao · Wencong Xiao · Xianyan Jia · Fei Xu · Yong Li · Wei Lin · Fangming Liu
The era of large deep learning models has led to advanced training strategies such as 3D parallelism and the ZeRO series. These strategies enable various (re-)configurable execution plans, each with remarkably different requirements of multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then allocate resources without considering the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, causing suboptimal training performance and cluster utilization. We introduce Morphling, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Morphling incorporates the job execution planning as a new dimension in cluster scheduling, by continuously reconfiguring jobs’ execution plans and tuning multi-resource allocations across jobs jointly. Such a co-optimization is navigated by a performance model that understands the diverse resource requirements and performance characteristics of different jobs and execution plans. Morphling exploits such a model to make performance-aware scheduling decisions to maximize cluster throughput while providing performance guarantees to individual jobs. Evaluations on a 64-GPU high-performance training cluster show that Morphling improves average job completion time and makespan by up to 3.2× and 1.4×, respectively, compared against state-of-the-art systems.