Attribution-based Sparse Activation in Large Language Models
Abstract
LLM inference is computationally expensive due to the LLM's large parameter sizes. Existing techniques reduce the computing cost via model retraining, but cannot well adapt to different downstream tasks or variant input data at runtime. To avoid such retraining efforts for runtime adaptability, a better option is \emph{sparse activation} that selectively deactivates an input-dependent set of neurons in inference, but current methods of \emph{lossless} sparse activation only deactivate neurons with zero output magnitudes, and are ineffective on recent LLMs with higher parameter efficiency. In this paper, we present a new technique of attribution-based sparse activation, which is a \emph{lossy} sparse activation technique that deactivates neurons with low attribution scores and aims to achieve the best tradeoff between model accuracy and computing costs. To ensure optimal sparse activation, we quantified the large errors of existing attribution metrics when used for sparse activation, due to the interdependency among attribution scores of different neurons, and further proposed a new attribution metric that can provably correct such errors. Experiments show that our technique can achieve 70\% model sparsity in difficult generative tasks such as question answering and text summarization with <5\% model accuracy loss. Such high model sparsity enables us to reduce the computing latency and memory use of LLM inference by 35\% and 40\%, respectively.