Skip to yearly menu bar Skip to main content


Poster

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Marco Federici · Davide Belli · Mart van Baalen · Amir Jalalirad · Andrii Skliar · Bence Major · Markus Nagel · Paul Whatmough


Abstract:

While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. To remedy the situation, previous work has proposed to leverage natural dynamic activation sparsity in ReLU activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU. In this paper, we show that SwiGLU has little to no inherent sparsity, and while SwiGLU activations can be pruned based on magnitude, the resulting sparsity pattern is very difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparse DRAM cache approach, which preserves accuracy with minimal finetuning. By adding a lightweight LoRA adapter, we can recover part of the accuracy lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which creates a dependence between cache state and sparsity mask to further increase cache hit rate, and increase LLM token rate on mobile devices. DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across hardware settings. On Phi-3-Medium, DIP achieves a 46% reduction in memory and 40% increase in throughput with <0.1 loss in perplexity.

Chat is not available.