Timezone: »
Training deep learning models has become an important workload on the public cloud. Scaling cloud-based distributed training faces unique challenges from the hierarchical network topology of the datacenter and the dynamic nature of the multi-tenant environment. Timely training of deep learning models requires effective use of topology-induced locality in the datacenter network. This work proposes PLink, an optimized communication library that probes the physical network and then generates and executes a fitted hierarchical aggregation plan to take advantage of such locality, and evolves the plan to adapt to changing network conditions. PLink needs no support from cloud providers and operates out-of-the-box on unmodified public clouds. PLink serves as a direct plug-in to many training frameworks, delivering up to 2.3x better end-to-end training throughput for popular DL models on Azure and EC2 compared to the state of the art.
Author Information
Liang Luo (University of Washington)
Peter West (University of Washington)
Jacob Nelson (Microsoft Research)
Arvind Krishnamurthy (University of Washington)
Luis Ceze (University of Washington and OctoML)
Related Events (a corresponding poster, oral, or spotlight)
-
2020 Poster: PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud »
Tue. Mar 3rd 12:30 -- 03:00 AM Room Ballroom A #2
More from the Same Authors
-
2022 Poster: DietCode: Automatic Optimization for Dynamic Tensor Programs »
Bojian Zheng · Ziheng Jiang · Cody Hao Yu · Haichen Shen · Joshua Fromm · Yizhi Liu · Yida Wang · Luis Ceze · Tianqi Chen · Gennady Pekhimenko -
2022 Poster: SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud »
Liang Luo · Peter West · Pratyush Patel · Arvind Krishnamurthy · Luis Ceze -
2022 Oral: SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud »
Liang Luo · Liang Luo · Peter West · Peter West · Pratyush Patel · Pratyush Patel · Arvind Krishnamurthy · Luis Ceze · Luis Ceze -
2022 Oral: DietCode: Automatic Optimization for Dynamic Tensor Programs »
Bojian Zheng · Ziheng Jiang · Cody Hao Yu · Haichen Shen · Joshua Fromm · Yizhi Liu · Yida Wang · Luis Ceze · Tianqi Chen · Gennady Pekhimenko -
2021 : Thoughts on Research, Community and Impact »
Luis Ceze -
2021 : Panel Discussion »
Luis Ceze · Cliff Young · Chris Lattner -
2020 Oral: Riptide: Fast End-to-End Binarized Neural Networks »
Joshua Fromm · Meghan Cowan · Matthai Philipose · Luis Ceze · Shwetak Patel -
2020 Poster: Riptide: Fast End-to-End Binarized Neural Networks »
Joshua Fromm · Meghan Cowan · Matthai Philipose · Luis Ceze · Shwetak Patel