Track: Session 2: Parallel and Distributed Systems

#34

Context Parallelism for Scalable Million-Token Inference

Amy Yang · Jingyi Yang · Aya Ibrahim · Xinfeng Xie · Bangsheng Tang · Grigory Sizov · Jongsoo Park · Jianyu Huang

We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93\% parallelization efficiency, 63\% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.

#40

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

Sandeep Polisetty · Juelin Liu · Yi Fung · Seung-Hwan Lim · Hui Guan · Marco Serafini

Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading and training of each mini-batch across multiple GPUs. Split parallelism however introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in Spa and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and P3.

#57

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

Xinyi Zhang · Hanyu Zhao · Wencong Xiao · Xianyan Jia · Fei Xu · Yong Li · Wei Lin · Fangming Liu

The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. The combination of these strategies enables various (re-)configurable execution plans for a training job, each exhibiting remarkably different requirements across multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then allocate resources without considering the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, making both training performance and cluster utilization far from optimal.We introduce Rubick, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Rubick incorporates the job execution planning as a new dimension in cluster scheduling, by continuously reconfiguring jobs’ execution plans and tuning multi-resource allocations across jobs jointly. Such a co-optimization is navigated by a performance model that understands the diverse resource requirements and performance characteristics of different jobs and execution plans. Rubick exploits such a model to make performance-aware scheduling decisions to maximize cluster throughput while providing performance guarantees to individual jobs. Evaluations on a 64-GPU high-performance training cluster show that Rubick reduces average job completion time and makespan by up to 3.2x and 1.4x, respectively, compared against state-of-the-art systems. The source code of Rubick is publicly available at https://github.com/AlibabaPAI/reconfigurable-dl-scheduler.

#6

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Daiyaan Arfeen · Zhen Zhang · Xinwei Fu · Gregory R. Ganger · Yida Wang

Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with <2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.

#60

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

Carlo Siebenschuh · Kyle Hippe · Ozan Gokdemir · Alexander Brace · Arham Khan · Khalid Hossain · Yadu Babuji · Nicholas Chia · Venkatram Vishwanath · Arvind Ramanathan · Rick Stevens · Ian Foster · Robert Underwood

Language models for scientific tasks are trained on text from scientific publications---most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML‑driven systems (for complex or degraded ones). The choice of the ``best'' parser for a particular document depends on 1) its computational cost and 2) the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and (aligned) predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by 17$\times$ while still achieving comparable accuracy (actually, 0.2\% better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at \url{https://github.com/7shoe/AdaParse/}.