A plethora of ML models are either sparsified (such as deep neural networks) to save memory footprint and FLOPs or inherently sparse due to their unstructured nature (such as graph neural networks). Nevertheless, even though sparsity is desired in theory, it often hampers the performance in practice because existing heterogeneous systems (such as GPUs and FPGAs) fall short in irregular computations. For example, as the GPU architectures are optimized for regular, dense computations, only a tiny portion of the theoretical GPU performance is realized when performing sparse computation. In this tutorial, we discuss the source of sparsity in deep neural networks as well as key techniques for mapping the sparse computation on heterogeneous systems to support high-performance inference and training. We will conclude this tutorial with a discussion on future work on model parallelism for optimizing sparse communications for large-scale sparse ML models.
Schedule
Tue 1:00 p.m. - 1:50 p.m.
|
Opening remarks and overview of sparsity in ML
(
Introduction
)
>
|
Wen-Mei Hwu · Jinjun Xiong 🔗 |
Tue 1:50 p.m. - 2:45 p.m.
|
Tiled SpMM and its performance model on GPUs
(
Session
)
>
|
Mert Hidayetoglu 🔗 |
Tue 2:45 p.m. - 3:45 p.m.
|
Sparse deep neural network inference on FPGAs
(
Session
)
>
|
Sitao Huang 🔗 |
Tue 3:45 p.m. - 4:00 p.m.
|
Coffee break
|
🔗 |
Tue 4:00 p.m. - 4:45 p.m.
|
2:4 Sparsity on GPU Tensor Cores
(
Session
)
>
|
Rakesh Nagi · Jeff Pool 🔗 |
Tue 4:45 p.m. - 5:00 p.m.
|
Future work and closing remarks
(
Conclusion
)
>
|
Vikram Sharma Mailthody 🔗 |