Timezone: »
Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g.,
TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVlinks). Examples today
include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a
complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective
communication algorithm, network topology, and the accelerator endpoint, as shown in the figure above.
Collective communications (e.g., all-reduce, all-to-all, reduce-scatter, all-gather) are initiated at different phases for different parallelism approaches – and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 and Recommendations such as DLRM have billions to trillions of parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design-space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models.
As an ongoing collaboration between Intel, Facebook and Georgia Tech, we have been jointly developing a detailed cycle-
accurate distributed training simulator called ASTRA-sim. ASTRA-sim models the co-design space described above and
schedules the compute-communication interactions from distributed training over plug-and-play compute and network
simulators. It enables a systematic study of bottlenecks at the software and hardware level for scaling training. It also enables end-to-end design-space exploration for running large DNN models over future training platforms. Papers detailing ASTRA-sim were presented at ISPASS 2020 and Hot Interconnects 2020. Currently, ASTRA-sim uses SCALE-sim (a Google TPU like simulator) as its compute model and provides a suite of network models (analytical network, Garnet from gem5 and NS3) to go from simple analytical to detailed cycle-accurate simulation of large-scale training platforms. To the best of our knowledge, ASTRA-sim is the first open-source simulator for modeling future distributed training platforms.
In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed training, demonstrate the capabilities of ASTRA-sim with examples and discuss ongoing development efforts.
Wed 1:00 p.m. - 2:00 p.m.
|
Introduction to Distributed DL Training
(
Talk
)
Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g., TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVLink). Examples today include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective communication algorithm, network topology, and the accelerator endpoint. Collective communications (e.g., All-Reduce, All-to-All, Reduce-Scatter, All-Gather) are initiated at different phases for different parallelism approaches — and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 and Recommendations such as DLRM have billions to trillions of parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design-space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models. As an ongoing collaboration between Intel, Facebook and Georgia Tech, we have been jointly developing a detailed cycle-accurate distributed training simulator called ASTRA-sim. ASTRA-sim models the co-design space described above and schedules the compute-communication interactions from distributed training over plug-and-play compute and network simulators. It enables a systematic study of bottlenecks at the software and hardware level for scaling training. It also enables end-to-end design-space exploration for running large DNN models over future training platforms. Currently, ASTRA-sim supports two compute models (roofline and SCALE-sim, a Google TPU-like simulator) and several network models (analytical network, Garnet from gem5, and NS3) to go from simple analytical to detailed cycle-accurate simulation of large-scale training platforms. In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed training, demonstrate the capabilities of ASTRA-sim with examples and discuss ongoing development efforts. |
Tushar Krishna 🔗 |
Wed 2:00 p.m. - 2:20 p.m.
|
Challenges on Distributed Training Systems
(
talk
)
|
🔗 |
Wed 2:20 p.m. - 3:30 p.m.
|
Introduction to ASTRA-sim simulator
(
talk
)
|
Saeed Rashidi 🔗 |
Wed 3:30 p.m. - 4:00 p.m.
|
Coffee Break
|
🔗 |
Wed 4:00 p.m. - 4:50 p.m.
|
Hands-on Exercises on Using ASTRA-sim
(
talk
)
|
William Won · Taekyung Heo 🔗 |
Wed 4:50 p.m. - 5:00 p.m.
|
Closing Remarks and Future Developments
(
Closing Remarks
)
|
Taekyung Heo 🔗 |
Author Information
Tushar Krishna (Georgia Institute of Technology)
More from the Same Authors
-
2023 Poster: XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse »
Hyoukjun Kwon · Krishnakumar Nair · Jamin Seo · Jason Yik · Debabrata Mohapatra · Dongyuan Zhan · JINOOK SONG · Peter Capak · Peizhao Zhang · Peter Vajda · Colby Banbury · Mark Mazumder · Liangzhen Lai · Ashish Sirasao · Tushar Krishna · Harshit Khaitan · Vikas Chandra · Vijay Janapa Reddi -
2023 Poster: SUBGRAPH STATIONARY HARDWARE-SOFTWARE INFERENCE CO-DESIGN »
Payman Behnam · Alexey Tumanov · Tushar Krishna · Pranav Gadikar · Yangyu Chen · Jianming Tong · Yue Pan · Abhimanyu Rajeshkumar Bambhaniya · Alind Khare -
2022 : Introduction to Distributed DL Training »
Tushar Krishna -
2021 : Panel Session - Lizy John (UT Austin), David Kaeli (Northeastern University), Tushar Krishna (Georgia Tech), Peter Mattson (Google), Brian Van Essen (LLNL), Venkatram Vishwanath (ANL), Carole-Jean Wu (Facebook) »
Tom St John · LIZY JOHn · Tushar Krishna · Peter Mattson · Venkatram Vishwanath · Carole-Jean Wu · David Kaeli · Brian Van Essen -
2021 Workshop: SysML4Health: Scalable Systems for ML-driven Analytics in Healthcare »
Alexey Tumanov · Jimeng Sun · Tushar Krishna · Vivek Sarkar · Dawn Song