Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

Ningning Xie · Tamara Norman · Dominik Grewe · Dimitrios Vytiniotis

Keywords: [ distributed and parallel learning ] [ programming languages for ml ]

[ Abstract ]
[ Paper PDF
Oral presentation: Testing, Debugging and Monitoring & Security
Tue 30 Aug 4 p.m. PDT — 5:30 p.m. PDT


We present a novel characterization of the mapping of multiple parallelism forms(e.g. data and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and greatly reduces the space of software-to-hardware mapping.We experimentally verify the substantial effect of these mappings on all-reduceperformance (up to 448x). We offer a novel syntax-guided programsynthesis framework that is able to decompose reductions over one or moreparallelism axes to sequences of collectives in a hierarchy- and mapping-awareway. For 69% of parallelism placements and user requested reductions, ourframework synthesizes programs that outperform the default all-reduceimplementation when evaluated on different GPU hierarchies (max 2.04x,average 1.27x). We complement our synthesis tool with a simulatorexceeding 90% top-10 accuracy, which therefore reduces the need for massiveevaluations of synthesis results to determine a small set of optimal programsand mappings.

Chat is not available.