Track: Research Track Oral Presentation: LLM Training 3

Wed 20 May 17:00 - 17:15 PDT

Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Runsheng Guo ⋅ Utkarsh Anand ⋅ Khuzaima Daudjee ⋅ Rathijit Sen

Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents significant challenges. The workload must be carefully partitioned such that GPUs in the cluster with limited compute, memory, or network bandwidth do not bottleneck the training process. Existing heterogeneous training systems cannot do so efficiently since they integrate data, pipeline, and tensor parallelism in a way that trades off communication for memory overhead. Combining vanilla data parallelism with pipeline parallelism is communication-efficient but results in high memory overhead from replicating model parameters. Alternatively, using sharded data parallelism or tensor parallelism reduces memory overhead but increases communication overhead when combined with pipeline parallelism. To address this problem, we designed Zorse, a system that uses Pipeline-Efficient ZeRO DP, a novel integration of pipeline parallelism and data parallelism that is both communication- and memory-efficient. Zorse uses a planner to automatically find an optimized training configuration from the vast search space of possibilities on heterogeneous clusters, and our evaluation shows that Zorse achieves up to 3× higher training throughput than state-of-the-art systems across representative heterogeneous training scenarios.

Wed 20 May 17:15 - 17:30 PDT

GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading

Jaeyong Song ⋅ Seongyeon Park ⋅ Hongsun Jang ⋅ Jaewon Jung ⋅ Hunseong Lim ⋅ Junguk Hong ⋅ Jinho Lee

Full-graph training of graph neural networks (GNNs) is widely used as it enables direct validation of algorithmic improvements by preserving complete neighborhood information. However, it typically requires multiple GPUs or servers, incurring substantial hardware and inter-device communication costs. While existing single-server methods reduce infrastructure requirements, they remain constrained by GPU and host memory capacity as graph sizes increase. To address this limitation, we introduce **GriNNder**, which is the first work to leverage storage devices to enable full-graph training even with limited memory. Because modern NVMe SSDs offer multi-terabyte capacities and bandwidths exceeding 10 GB/s, they provide an appealing option when memory resources are scarce. Yet, directly applying storage-based methods from other domains fails to address the unique access patterns and data dependencies in full-graph GNN training. GriNNder tackles these challenges by *structured storage offloading (SSO)*, a framework that manages the GPU-host-storage hierarchy through coordinated *cache*, *(re)gather*, and *bypass* mechanisms. To realize the framework, we devise (i) a partition-wise caching strategy for host memory that exploits the observation on cross-partition dependencies, (ii) a regathering strategy for gradient computation that eliminates redundant storage operations, and (iii) a lightweight partitioning scheme that mitigates the memory requirements of existing graph partitioners. In experiments performed over various models and datasets, GriNNder achieves up to 9.78$\times$ speedup over state-of-the-art baselines and throughput comparable to distributed systems, enabling previously infeasible large-scale full-graph training even on a single GPU.

Wed 20 May 17:30 - 17:45 PDT

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

Ran Yan ⋅ YOUHE JIANG ⋅ Xiaonan Nie ⋅ Fangcheng Fu ⋅ Bin CUI ⋅ Binhang Yuan

Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (**i**) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves *similar* performance when running over heterogeneous GPUs with the *same* theoretical FLOPS; (**ii**) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.

Wed 20 May 17:45 - 18:00 PDT

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Luca Colagrande ⋅ Lorenzo Leone ⋅ Chen Wu ⋅ Tim Fischer ⋅ Raphael Roth ⋅ Luca Benini

The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores’ computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9× and 2.5× geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 2.1× and 2.1× estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture.

Wed 20 May 18:00 - 18:15 PDT

DreamDDP: Accelerating Low-Bandwidth Geo-Distributed LLM Training with Layer-wise Partial Synchronization

Zhenheng Tang ⋅ Zichen TANG ⋅ Junlin Huang ⋅ Xinglin Pan ⋅ Rudan Yan ⋅ Yuxin Wang ⋅ Amelie Zhou ⋅ Shaohuai Shi ⋅ Xiaowen Chu ⋅ Bo Li

Scaling up training large language models (LLMs) in computing and data perspectives motivates distributed training across different geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Recent studies have successfully applied Local SGD to mitigate the communication overhead and geo-distributedly pre-train LLMs. However, we identify that the strict model synchronization mechanism in Local SGD prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only partial layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule communication and computation based on fine-grained layer-wise profiling to reduce training time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\times$ to $3.91\times$ over leading baseline methods.

Wed 20 May 18:15 - 18:30 PDT

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Fengjuan Wang ⋅ Zhiyi Su ⋅ Xingzhu Hu ⋅ Cheng Wang ⋅ Sun Mou

Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize–dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5 GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, with the reference implementation available at \href{https://github.com/021ai/FP8-FLOW-MOE-AE}{our GitHub repository}.

Main Navigation

Session