Tutorial: Sparsity in ML: Understanding and Optimizing Sparsity in Neural Networks Running on Heterogeneous Systems
Future work and closing remarks
Vikram Sharma Mailthody
Model parallelism is a technique used to address ever increasing demand for more compute and memory capacity in deep learning training or inference. Model parallelism allows it to scale to hundreds to thousands of GPUs with ease. Often the communication pattern is dependent on the model used and can result in sparse, irregular access to neighboring GPUs especially when computing a sparse layer or graph computations. If frequent communication is required, communications take most of the execution time. In this session, we discuss the future work on optimizing sparse communications for massive sparse matrices. We target the communication architecture on multi-GPU nodes: GPUs in the same node are connected with a high-bandwidth interconnect. We will present a proof of concept that alleviates the communication bottleneck of sparse scatter and gather operations by over 60% on OLCF’s Summit supercomputer. Finally, we will conclude this tutorial with final remarks on open problems and future outlook.
See Vikram's bio here.