PyTorch RPC: Distributed Deep Learning Built on Tensor-Optimized Remote Procedure Calls
Pritam Damania · Shen Li · Alban Desmaison · Alisson Azzolini · Brian Vaughan · Edward Yang · Gregory Chanan · Guoqiang Jerry Chen · Hongyi Jia · Howard Huang · Joseph Spisak · Luca Wehrstedt · Lucas Hosseini · Manoj Krishnan · Omkar Salpekar · Pavel Belevich · Rohan Varma · Satendra Gera · Wanchao Liang · Shihao Xu · Soumith Chintala · Chaoyang He · Amir Ziashahabi · Salman Avestimehr · · Zachary DeVito
Ballroom B - Position 42
Distributed training technologies have advanced rapidly in the past few years and have unlocked unprecedented scalability with increasingly complex solutions. These technologies have made distributed training much more efficient and accessible, though they impose specific constraints on the training paradigm or the model structure. As a result, applications that fail to meet these constraints must rely on general-purpose distributed computing frameworks to scale out. However, without access to the internal components of deep learning frameworks, these distributed computing frameworks usually significantly fall short in terms of efficiency and usability. To address these problems, we propose PyTorch RPC as a generic and high-performance solution for distributed deep learning. Compared to generic distributed computing frameworks, PyTorch RPC natively provides essential features for implementing training applications in a distributed environment, including optimized tensor communications, remote memory management, and distributed autograd. Evaluations show that PyTorch RPC attains up to two orders of magnitude faster tensor communication compared to gRPC with one-tenth of the user code. Case studies further demonstrate that users can easily employ PyTorch RPC to build efficient reinforcement learning applications (video game solver), implement large language models (GPT3), train recommendation models (DLRM), and scale federated learning tasks (FedML).