HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
Ran Yan ⋅ YOUHE JIANG ⋅ Xiaonan Nie ⋅ Fangcheng Fu ⋅ Bin CUI ⋅ Binhang Yuan
Abstract
Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (\underline{i}) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves \textit{similar} performance when running over heterogeneous GPUs with the \textit{same} theoretical FLOPS; (\underline{ii}) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.
Chat is not available.
Successful Page Load