Skip to yearly menu bar Skip to main content


Large Language Models 1

Mission B4 & B14
Tue 14 May 1:30 p.m. PDT — 3 p.m. PDT


Chat is not available.

Tue 14 May 13:30 - 13:50 PDT

Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Zhenyu Zhang · Shiwei Liu · Runjin Chen · Bhavya Kailkhura · Beidi Chen · Atlas Wang

This paper focuses on addressing the substantial memory footprints and bandwidth costs associated with the deployment of Large Language Models (LLMs). LLMs, characterized by their extensive context length (e.g., $\geq$4096), inherently demands vast memory resource and traffic to store and load the attention key and value embeddings within self-attention modules, referred to as the KV cache. In an effort to alleviate these resource-intensive aspects of LLM inference, techniques such as sparsification and quantization for KV cache reduction have been investigated as separate endeavors within the realm of LLMs. However, this paper illuminates the critical importance of considering the compound effects of these techniques when employed together, as a simplistic amalgamation of sparsification and quantization can yield sub-optimal performance.For instance, the "Heavy Hitter Oracle" has demonstrated that preserving just 20\% of the KV cache attributed to pivotal tokens, denoted as "Heavy Hitters", can yield substantial memory savings while upholding the model's original performance. Furthermore, the KV cache of these "Heavy Hitter" tokens, which are identified as those with the highest accumulated attention scores, can be further quantized with encouraging throughput saving.Nevertheless, our investigation uncovers two primary deficiencies in such unrefined post-sparsification quantization in low-bit scenarios: (1) the application of low-bit KV cache quantization, specifically $\leq$ 4-bit, significantly diminishes the accuracy of Heavy Hitter selection during the generation phase, particularly in deeper layers; (2) tokens selected by the "Heavy Hitter Oracle" are not necessarily well-suited for quantization, and their quantization can lead to sub-optimal performance. To surmount these challenges, we propose a novel rule-of-thumb for token selection during LLM generation, termed Q-Hitter. This approach combines both accumulated attention scores and "Quantization Friendliness" metrics for different layers, identifying tokens that are not only pivotal for preserving the generalization capabilities of LLMs but are also more amenable to KV cache quantization. Q-Hitter naturally offers a free lunch of KV cache quantization and can further escalate the affordability of state-of-the-art LLMs. Additionally, Q-Hitter empowers LLMs to effectively handle inputs of infinite sequence length. Extensive experiments conducted across various LLMs and tasks substantiate the superiority of the proposed Q-Hitter framework over the original H$_2$O framework. Remarkably, Q-Hitter achieves full model quality preservation while delivering up to a remarkable 20$\times$ reduction in memory usage and up to 33$\times$, 33$\times$, 4$\times$ and 1.3$\times$ throughput improvements compared with the Hugginface Accelerate, DeepSpeed, FlexGen and $\mathsf{H_2O}$, respectively. The code will be public upon acceptance.

Tue 14 May 13:50 - 14:10 PDT

Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems

Yunhao Yang · Neel P. Bhatt · Tyler Ingebrand · William Ward · Steven Carr · Atlas Wang · Ufuk Topcu

Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address this limitation, however, sourcing human feedback is labor intensive and costly. We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems, bridging the gap between generic knowledge and domain-specific requirements while reducing cost. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. These controllers are verifiable against independently provided specifications within a world model, which can be abstract or obtained from a high-fidelity simulator. Controllers with high compliance with the desired specifications receive higher ranks, guiding the iterative fine-tuning process. We provide quantitative evidences, primarily in autonomous driving, to demonstrate the method's effectiveness across multiple tasks. The results indicate an improvement in percentage of specifications satisfied by the controller from 60\% to 90\%.

Tue 14 May 14:20 - 14:40 PDT

Punica: Multi-Tenant LoRA Serving

Lequn Chen · Zihao Ye · Yongji Wu · Danyang Zhuo · Luis Ceze · Arvind Krishnamurthy

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains.We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token.

Tue 14 May 14:40 - 15:00 PDT

SLoRA: Scalable Serving of Thousands of LoRA Adapters

Ying Sheng · Shiyi Cao · Dacheng Li · Coleman Hooper · Nicholas Lee · Shuo Yang · Christopher Chou · Banghua Zhu · Lianmin Zheng · Kurt Keutzer · Joseph Gonzalez · Ion Stoica

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present SLoRA, a system designed for the scalable serving of many LoRA adapters. SLoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, SLoRA proposes a unified memory pool. This memory pool uses a unified paging mechanism to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths.Additionally, SLoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for batched LoRA computation. Collectively, these features enable SLoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), SLoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, SLoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services.