A plethora of ML models are either sparsified (such as deep neural networks) to save memory footprint and FLOPs or inherently sparse due to their unstructured nature (such as graph neural networks). Nevertheless, even though sparsity is desired in theory, it often hampers the performance in practice because existing heterogeneous systems (such as GPUs and FPGAs) fall short in irregular computations. For example, as the GPU architectures are optimized for regular, dense computations, only a tiny portion of the theoretical GPU performance is realized when performing sparse computation. In this tutorial, we discuss the source of sparsity in deep neural networks as well as key techniques for mapping the sparse computation on heterogeneous systems to support high-performance inference and training. We will conclude this tutorial with a discussion on future work on model parallelism for optimizing sparse communications for large-scale sparse ML models.
Tue 1:00 p.m. - 1:50 p.m.
|
Opening remarks and overview of sparsity in ML
(
Introduction
)
We will give an overview of the sparsity problems in ML and summarize some of the latest work that address the challenges in handling sparsity from both systems and algorithm perspectives. We will discuss sparsity both as a result of pruning dense models and as an inherent property of other ML models. We will then discuss the challenges in implementing sparse algorithms on heterogeneous systems. We will conclude this session with an overview of the various runtime software libraries and tools, such as EMOGI, Pytorch-DGL, Tiled SpMM, BaM, and 2:4 sparsity, that we have developed recently to address the compute and memory challenges in handling sparsity in heterogeneous computing system. This session aims to give the audience the foundation for understanding the rest of the sessions. See Wen-mei's bio here. See Jinjun's bio here. |
Wen-Mei Hwu · Jinjun Xiong 🔗 |
Tue 1:50 p.m. - 2:45 p.m.
|
Tiled SpMM and its performance model on GPUs
(
Session
)
Sparse matrix - dense matrix multiplication (SpMM) is a common operation for ML computations. In this session, we will give a tutorial on our loop reordering and tiling strategies for optimizing SpMM on GPUs. We will present extensive benchmark results on A100 GPUs to show that the proposed Tiled SpMM mechanism outperforms previous approaches and reaches the theoretical peak performance given by the sparsity pattern and the underlying architecture. We will then explain how a high-fidelity performance model based on the memory-bandwidth can be used to understand the measured performance of sparse-matrix tiling strategies and identify additional optimizations such as load balancing and row/column permutation of the sparse matrix. For demonstration, we will use sparse deep neural network (DNN) inference with the MIT/Amazon/IEEE Graph Challenge benchmark networks as a running example throughout this session. See Mert's bio here. Graph Challenge Codebase: here. Graph Challenge Publication: here. MLSys Presentation Slides: here. |
Mert Hidayetoglu 🔗 |
Tue 2:45 p.m. - 3:45 p.m.
|
Sparse deep neural network inference on FPGAs
(
Session
)
This session presents the design and implementation of a highly flexible sparse DNN inference accelerator on FPGAs using high-level synthesis (HLS). We will explain how the custom sparse computation hardware synthesized from C/C++ and Python can achieve higher energy efficiency than CPUs and GPUs. Our proposed inference engine can be easily configured to be used in both mobile/edge computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7 times in terms of energy efficiency. We will conclude with a survey of sparse support in related FPGA and ASIC accelerators. See Sitao's bio here. |
Sitao Huang 🔗 |
Tue 3:45 p.m. - 4:00 p.m.
|
Coffee break
|
🔗 |
Tue 4:00 p.m. - 4:45 p.m.
|
2:4 Sparsity on GPU Tensor Cores
(
Session
)
Recent NVIDIA GPUs have introduced support for 2:4 sparsity in their Tensor Cores to better support sparsified deep neural network models. In this session, we will first present what a 2:4 sparsity pattern is and why it is a good idea for both performance and accuracy (regularity vs. irregular/unstructured, fine-grained vs. coarse-grained). We will then explain how speedup is achieved in hardware along with some performance numbers, followed by details on the associated training process and some accuracy numbers. We will discuss new techniques that search for permutations of model parameters to improve the efficiency of the hardware execution. This session will end with practical ways and best practices to tap into 2:4 sparsity in deep learning frameworks. See Jeff's bio here. See Rakesh's bio here. |
Rakesh Nagi · Jeff Pool 🔗 |
Tue 4:45 p.m. - 5:00 p.m.
|
Future work and closing remarks
(
Conclusion
)
Model parallelism is a technique used to address ever increasing demand for more compute and memory capacity in deep learning training or inference. Model parallelism allows it to scale to hundreds to thousands of GPUs with ease. Often the communication pattern is dependent on the model used and can result in sparse, irregular access to neighboring GPUs especially when computing a sparse layer or graph computations. If frequent communication is required, communications take most of the execution time. In this session, we discuss the future work on optimizing sparse communications for massive sparse matrices. We target the communication architecture on multi-GPU nodes: GPUs in the same node are connected with a high-bandwidth interconnect. We will present a proof of concept that alleviates the communication bottleneck of sparse scatter and gather operations by over 60% on OLCF’s Summit supercomputer. Finally, we will conclude this tutorial with final remarks on open problems and future outlook. See Vikram's bio here. |
Vikram Sharma Mailthody 🔗 |