Skip to yearly menu bar Skip to main content


( events)   Timezone:  
Oral
Mon Mar 02 06:25 AM -- 06:50 AM (PST) @ Ballroom A #102
PLink: Discovering and Exploiting Locality for Accelerated Distributed Training on the public Cloud
Liang Luo · Peter West · Jacob Nelson · Arvind Krishnamurthy · Luis Ceze

Training deep learning models has become an important workload on the public cloud. Scaling cloud-based distributed training faces unique challenges from the hierarchical network topology of the datacenter and the dynamic nature of the multi-tenant environment. Timely training of deep learning models requires effective use of topology-induced locality in the datacenter network. This work proposes PLink, an optimized communication library that probes the physical network and then generates and executes a fitted hierarchical aggregation plan to take advantage of such locality, and evolves the plan to adapt to changing network conditions. PLink needs no support from cloud providers and operates out-of-the-box on unmodified public clouds. PLink serves as a direct plug-in to many training frameworks, delivering up to 2.3x better end-to-end training throughput for popular DL models on Azure and EC2 compared to the state of the art.