Skip to yearly menu bar Skip to main content


Session

Parallel and Distributed 2

Mission B4 & B8
Wed 15 May 4:30 p.m. PDT — 5:30 p.m. PDT

Abstract:

Chat is not available.

Wed 15 May 16:30 - 16:50 PDT

19
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication

Chenyu Jiang · Ye Tian · Zhen Jia · Chuan Wu · Yida Wang · Shuai Zheng

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters, but it grapples with the challenge of prolonged all-to-all communication latency during training. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. However, this approach often falls short of achieving sufficient overlap, thereby limiting potential performance improvements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, an optimization system for DNN compilers designed to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

Wed 15 May 16:50 - 17:10 PDT

36
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation

Liang Luo · Buyun Zhang · Michael Tsang · Yinbin Ma · Ching-Hsiang Chu · Yuxin Chen · Shen Li · Yuchen Hao · Yanli Zhao · Guna Lakshminarayanan · Ellie Wen · Jongsoo Park · Dheevatsa Mudigere · Maxim Naumov

We study a mismatch between the deep learning recommendation models’ flat architecture, common distributedtraining paradigm and hierarchical data center topology. To address the associated inefficiencies, we proposeDisaggregated Multi-Tower (DMT), a modeling technique that consists of (1) semantic-preserving tower transform(SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjointtowers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to eachtower to reduce model complexity and communication volume through hierarchical feature interaction; and (3)Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactionsand load balanced assignments to preserve model quality and training throughput via learned embeddings. Weshow that DMT can achieve up to 1.9× speedup compared to the state-of-the-art baselines without losing accuracyacross multiple generations of hardware at large data center scales.

Wed 15 May 17:10 - 17:30 PDT

37
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

ZHAO XUANLEI · Bin Jia · Haotian Zhou · Ziming Liu · Shenggan Cheng · Yang You

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317\% at most.