MLSys Poster Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication

Poster

Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication

Chenyu Jiang · Ye Tian · Zhen Jia · Chuan Wu · Yida Wang · Shuai Zheng

[ Abstract ]

[ Slides]

2024 Poster

Abstract:

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters, but it grapples with the challenge of prolonged all-to-all communication latency during training. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. However, this approach often falls short of achieving sufficient overlap, thereby limiting potential performance improvements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, an optimization system for DNN compilers designed to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

Chat is not available.