Tutorial: Sparsity in ML: Understanding and Optimizing Sparsity in Neural Networks Running on Heterogeneous Systems
Tiled SpMM and its performance model on GPUs
Sparse matrix - dense matrix multiplication (SpMM) is a common operation for ML computations. In this session, we will give a tutorial on our loop reordering and tiling strategies for optimizing SpMM on GPUs. We will present extensive benchmark results on A100 GPUs to show that the proposed Tiled SpMM mechanism outperforms previous approaches and reaches the theoretical peak performance given by the sparsity pattern and the underlying architecture. We will then explain how a high-fidelity performance model based on the memory-bandwidth can be used to understand the measured performance of sparse-matrix tiling strategies and identify additional optimizations such as load balancing and row/column permutation of the sparse matrix. For demonstration, we will use sparse deep neural network (DNN) inference with the MIT/Amazon/IEEE Graph Challenge benchmark networks as a running example throughout this session.
See Mert's bio here.
Graph Challenge Codebase: here.
Graph Challenge Publication: here.
MLSys Presentation Slides: here.