Skip to yearly menu bar Skip to main content


Efficient GPU Kernels for N:M-Sparse Weights in Deep Learning

Bin Lin · Ningxin Zheng · Lei Wang · Shijie Cao · Lingxiao Ma · Quanlu Zhang · Yi Zhu · Ting Cao · Jilong Xue · Yuqing Yang · Fan Yang

Ballroom B - Position 19


N:M sparsity is becoming increasingly popular due to its promise to achieve both high model accuracy and computational efficiency for deep learning. However, the real-world benefit of N:M sparsity is limited as there is a lack of dedicated GPU kernel implementations for general N:M sparsity with various sparsity ratios. In this work, we present nmSPARSE, a library of efficient GPU kernels for two fundamental operations in neural networks with N:M sparse weights: sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). By leveraging the intrinsic balance characteristic of N:M sparsity, nmSPARSE kernels rearrange irregular computation and scattered memory accesses in sparse matrix multiplication into hardware-aligned regular computation and conflict-free memory accesses at runtime. Evaluated on NVIDIA A100 GPU, nmSPARSE kernels achieve up to 5.2× speedup on SpMV and 6.0× speedup on SpMM over the fastest baseline. End-to-end studies on transformer models demonstrate that using nmSPARSE outperforms other baselines.

Chat is not available.