Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g.,
TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVlinks). Examples today
include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a
complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective
communication algorithm, network topology, and the accelerator endpoint, as shown in the figure above.
Collective communications (e.g., all-reduce, all-to-all, reduce-scatter, all-gather) are initiated at different phases for different parallelism approaches – and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 and Recommendations such as DLRM have billions to trillions of parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design-space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models.
As an ongoing collaboration between Intel, Facebook and Georgia Tech, we have been jointly developing a detailed cycle- …