Poster

In-network Aggregation for Shared Machine Learning Clusters

Nadeen Gebara · Manya Ghobadi · Paolo Costa

Keywords: Generative Models Theory Learning Theory Deep Learning -> Adversarial Networks; Deep Learning -> Deep Autoencoders; Deep Learning

Outstanding Paper Award

2021 Poster

[ Paper PDF] [ Slides]

Abstract

We present PANAMA, a network architecture for machine learning (ML) workloads on shared clusters where a variety of training jobs co-exist.PANAMA consists of two key components: (i) an efficient in-network hardware accelerator designed to accelerate large data-parallel training transfers; and (ii) a lightweight congestion control protocol to enable fair sharing of network resources across different flows. Our congestion control protocol exploits the unique communication pattern in training to ensure large in-network aggregation transfers do not negatively impact short latency-sensitive flows. To evaluate the feasibility of PANAMA, we build an FPGA-based prototype with 10 Gbps transceivers and show that our hardware datapath achieves line-rate aggregation. Our large-scale simulations demonstrate that PANAMA improves the mean and 99%-tile completion time of latency-sensitive short flows by a factor of 2–4.5 while reducing the average training time of large jobs by a factor of 1.25.

Video

The live parts of this page are not open to all registrants until 2021-04-06 00:00:00+00:00. You are seeing them because you have privileged access.

Chat is not available.