Distributed and Parallel Learning

Exhibit Hall A

Moderator: Shivaram Venkataraman

Mon 29 Aug 1 p.m. PDT — 2:15 p.m. PDT


Chat is not available.

Mon 29 Aug. 13:00 - 13:18 PDT

BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling

Cheng Wan · Cheng Wan · Youjie Li · Ang Li · Ang Li · Nam Sung Kim · Nam Sung Kim · Yingyan Lin

Graph Convolutional Networks (GCNs) have emerged as the state-of-the-art method for graph-based learning tasks. However, training GCNs at scale is still challenging, hindering both the exploration of more sophisticated GCN architectures and their applications to real-world large graphs. While it might be natural to consider graph partition and distributed training for tackling this challenge, this direction has only been slightly scratched the surface in the previous works due to the limitations of existing designs. In this work, we first analyze why distributed GCN training is ineffective and identify the underlying cause to be the excessive number of boundary nodes of each partitioned subgraph, which easily explodes the memory and communication costs for GCN training. Furthermore, we propose a simple yet effective method dubbed BNS-GCN that adopts random Boundary-Node-Sampling to enable efficient and scalable distributed GCN training. Experiments and ablation studies consistently validate the effectiveness of BNS-GCN, e.g., boosting the throughput by up to 16.2× and reducing the memory usage by up to 58%, while maintaining a full-graph accuracy. Furthermore, both theoretical and empirical analysis show that BNS-GCN enjoys a better convergence than existing sampling-based methods. We believe that our BNS-GCN has opened up a new paradigm for enabling GCN training at scale. The code is available at

Mon 29 Aug. 13:18 - 13:36 PDT

Sequential Aggregation and Rematerialization: Distributed Full-batch Training of Graph Neural Networks on Large Graphs

Hesham Mostafa

We present the Sequential Aggregation and Rematerialization (SAR) scheme for distributed full-batch training of Graph Neural Networks (GNNs) on large graphs. Large-scale training of GNNs has recently been dominated by sampling-based methods and methods based on non-learnable message passing. SAR on the other hand is a distributed technique that can train any GNN type directly on an entire large graph. The key innovation in SAR is the distributed sequential rematerialization scheme which sequentially re-constructs then frees pieces of the prohibitively large GNN computational graph during the backward pass. This results in excellent memory scaling behavior where the memory consumption per worker goes down linearly with the number of workers, even for densely connected graphs. Using SAR, we report the largest applications of full-batch GNN training to-date, and demonstrate large memory savings as the number of workers increases. We also present a general technique based on kernel fusion and attention-matrix rematerialization to optimize both the runtime and memory efficiency of attention-based models. We show that, coupled with SAR, our optimized attention kernels lead to significant speedups and memory savings in attention-based GNNs.

Mon 29 Aug. 13:36 - 13:54 PDT

PAPAYA: Practical, Private, and Scalable Federated Learning

Dzmitry Huba · John Nguyen · Kshitiz Malik · Ruiyu Zhu · Mike Rabbat · Ashkan Yousefpour · Carole-Jean Wu · Hongyuan Zhan · Pavel Ustinov · Harish Srinivas · Kaikai Wang · Anthony Shoumikhin · Jesik Min · Mani Malek

Cross-device Federated Learning (FL) is a distributed learning paradigm with several challenges that differentiate it from traditional distributed learning: variability in the system characteristics on each device, and millions of clients coordinating with a central server being primary ones. Most FL systems described in the literature are synchronous in nature --- they perform a synchronized aggregation of model updates from individual clients. Scaling synchronous FL is challenging since increasing the number of clients training in parallel leads to diminishing returns in training speed, analogous to large-batch training. Moreover, synchronous FL can be slow due to stragglers. In this work, we describe the design of a production asynchronous FL system to tackle the aforementioned issues, sketch some of the system design challenges and their solutions, and touch upon principles that emerged from building the production system for millions of clients. Empirically, we demonstrate that asynchronous FL is significantly faster than synchronous FL when training across millions of devices. In particular, in high concurrency settings, asynchronous FL converges 5$\times$ faster while being nearly 8$\times$ more resource efficient than synchronous FL.

Mon 29 Aug. 13:54 - 14:12 PDT

LightSecAgg: a Lightweight and Versatile Design for Secure Aggregation in Federated Learning

Jinhyun So · · Chien-Sheng Yang · Songze Li · Qian Yu · Ramy E. Ali · Basak Guler · Salman Avestimehr

Secure model aggregation is a key component of federated learning (FL) that aims at protecting the privacy of each user’s individual model while allowing for their global aggregation. It can be applied to any aggregation-based FL approach for training a global or personalized model. Model aggregation needs to also be resilient against likely user dropouts in FL systems, making its design substantially more complex. State-of-the-art secure aggregation protocols rely on secret sharing of the random-seeds used for mask generations at the users to enable the reconstruction and cancellation of those belonging to the dropped users. The complexity of such approaches, however, grows substantially with the number of dropped users. We propose a new approach, named LightSecAgg, to overcome this bottleneck by changing the design from random-seed reconstruction of the dropped users'' toone-shot aggregate-mask reconstruction of the active users via mask encoding/decoding''. We show that LightSecAgg achieves the same privacy and dropout-resiliency guarantees as the state-of-the-art protocols while significantly reducing the overhead for resiliency against dropped users. We also demonstrate that, unlike existing schemes, LightSecAgg can be applied to secure aggregation in the asynchronous FL setting. Furthermore, we provide a modular system design and optimized on-device parallelization for scalable implementation, by enabling computational overlapping between model training and on-device encoding, as well as improving the speed of concurrent receiving and sending of chunked masks. We evaluate LightSecAgg via extensive experiments for training diverse models (logistic regression, shallow CNNs, MobileNetV3, and EfficientNet-B0) on various datasets (MNIST, FEMNIST, CIFAR-10, GLD-23K) in a realistic FL system with large number of users and demonstrate that LightSecAgg significantly reduces the total training time.