Timezone: »
Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware heterogeneity. To address this issue, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for hybrid, and faster, data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8× faster model synchronization (AllReduce), and reduce end-to-end DNN training time for image classification tasks by up to 40%.
Author Information
Guanhua Wang (UC Berkeley)
I am a Ph.D. student in the AMPLab / RISELab, at UC Berkeley, advised by Prof. Ion Stoica.
Shivaram Venkataraman (University of Wisconsin, Madison)
Amar Phanishayee (Microsoft Research)
Nikhil Devanur (Microsoft)
Jorgen Thelin (Microsoft Research)
Ion Stoica (UC Berkeley)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: Blink: Fast and Generic Collectives for Distributed ML »
Tue. Mar 3rd 12:30 -- 03:00 AM Room Ballroom A #31
More from the Same Authors
-
2022 Poster: On the Utility of Gradient Compression in Distributed Training Systems »
Saurabh Agarwal · Hongyi Wang · Shivaram Venkataraman · Dimitris Papailiopoulos -
2022 Oral: On the Utility of Gradient Compression in Distributed Training Systems »
Saurabh Agarwal · Hongyi Wang · Shivaram Venkataraman · Dimitris Papailiopoulos -
2021 Poster: Wavelet: Efficient DNN Training with Tick-Tock Scheduling »
Guanhua Wang · Kehan Wang · Kenan Jiang · XIANGJUN LI · Ion Stoica -
2021 Oral: Wavelet: Efficient DNN Training with Tick-Tock Scheduling »
Guanhua Wang · Kehan Wang · Kenan Jiang · XIANGJUN LI · Ion Stoica -
2021 Poster: sensAI: ConvNets Decomposition via Class Parallelism for Fast Inference on Live Data »
Guanhua Wang · Zhuang Liu · Brandon Hsieh · Siyuan Zhuang · Joseph Gonzalez · Trevor Darrell · Ion Stoica -
2021 Poster: Adaptive Gradient Communication via Critical Learning Regime Identification »
Saurabh Agarwal · Hongyi Wang · Kangwook Lee · Shivaram Venkataraman · Dimitris Papailiopoulos -
2021 Oral: sensAI: ConvNets Decomposition via Class Parallelism for Fast Inference on Live Data »
Guanhua Wang · Zhuang Liu · Brandon Hsieh · Siyuan Zhuang · Joseph Gonzalez · Trevor Darrell · Ion Stoica -
2021 Oral: Adaptive Gradient Communication via Critical Learning Regime Identification »
Saurabh Agarwal · Hongyi Wang · Kangwook Lee · Shivaram Venkataraman · Dimitris Papailiopoulos