Oral

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Shaohuai Shi ⋅ Xianhao Zhou ⋅ Shutao Song ⋅ Xingyao Wang ⋅ Zilin Zhu ⋅ Xue Huang ⋅ Xinan Jiang ⋅ Feihu Zhou ⋅ Zhenyu Guo ⋅ Liqiang Xie ⋅ Rui Lan ⋅ Xianbin Ouyang ⋅ Yan Zhang ⋅ Jieqian Wei ⋅ Jing Gong ⋅ Weiliang Lin ⋅ Ping Gao ⋅ Peng Meng ⋅ Xiaomin Xu ⋅ Chenyang Guo ⋅ Bo Yang ⋅ Zhibo Chen ⋅ Yongjian Wu ⋅ Xiaowen Chu

2021 Oral

[ Paper PDF]

Abstract

Distributed training techniques have been widely deployed in large-scale deep models training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional state-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

Video

The live parts of this page are not open to all registrants until 2021-04-07 22:20:00+00:00. You are seeing them because you have privileged access.

Chat is not available.