Timezone: »

PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud
Liang Luo · Peter West · Jacob Nelson · Arvind Krishnamurthy · Luis Ceze

Mon Mar 02 06:25 AM -- 06:50 AM (PST) @ Ballroom A #102

Training deep learning models has become an important workload on the public cloud. Scaling cloud-based distributed training faces unique challenges from the hierarchical network topology of the datacenter and the dynamic nature of the multi-tenant environment. Timely training of deep learning models requires effective use of topology-induced locality in the datacenter network. This work proposes PLink, an optimized communication library that probes the physical network and then generates and executes a fitted hierarchical aggregation plan to take advantage of such locality, and evolves the plan to adapt to changing network conditions. PLink needs no support from cloud providers and operates out-of-the-box on unmodified public clouds. PLink serves as a direct plug-in to many training frameworks, delivering up to 2.3x better end-to-end training throughput for popular DL models on Azure and EC2 compared to the state of the art.

Author Information

Liang Luo (University of Washington)
Peter West (University of Washington)
Jacob Nelson (Microsoft Research)
Arvind Krishnamurthy (University of Washington)
Luis Ceze (University of Washington and OctoML)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors