In-network Aggregation for Shared Machine Learning Clusters

Nadeen Gebara, Manya Ghobadi, Paolo Costa

award Outstanding Paper Award
[ Abstract ] [ Livestream: Visit Session 3: Communication and Storage ]
Tue 6 Apr 1:50 p.m. — 2:10 p.m. PDT
[ Paper PDF

The videos for each part of this talk are linked above.

We present PANAMA, a network architecture for machine learning (ML) workloads on shared clusters where a variety of training jobs co-exist.PANAMA consists of two key components: (i) an efficient in-network hardware accelerator designed to accelerate large data-parallel training transfers; and (ii) a lightweight congestion control protocol to enable fair sharing of network resources across different flows. Our congestion control protocol exploits the unique communication pattern in training to ensure large in-network aggregation transfers do not negatively impact short latency-sensitive flows. To evaluate the feasibility of PANAMA, we build an FPGA-based prototype with 10 Gbps transceivers and show that our hardware datapath achieves line-rate aggregation. Our large-scale simulations demonstrate that PANAMA improves the mean and 99%-tile completion time of latency-sensitive short flows by a factor of 2–4.5 while reducing the average training time of large jobs by a factor of 1.25.

Chat is not available.