Skip to yearly menu bar Skip to main content


Timezone: US/Pacific

Registration Desk: Registration Check-in Desk Tue 14 May 07:00 a.m.  


Opening Remarks Tue 14 May 08:45 a.m.  


Poster: Quantization and Compression 1 Tue 14 May 09:00 a.m.  

Poster
Ji Lin · Jiaming Tang · Haotian Tang · Shang Yang · Wei-Ming Chen · Wei-Chen Wang · Guangxuan Xiao · Xingyu Dang · Chuang Gan · Song Han

[ Poster Position Number ]

Abstract

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B LLaMA-2 model on mobile GPUs.

Poster
Elias Frantar · Dan Alistarh

[ Poster Position Number ]

Abstract

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The anonymized code is …

Poster
Yilong Zhao · Chien-Yu Lin · Kan Zhu · Zihao Ye · Lequn Chen · Size Zheng · Luis Ceze · Arvind Krishnamurthy · Tianqi Chen · Baris Kasikci

[ Poster Position Number ]

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligentchatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently useGPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to furtherspeed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity.However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage thecapabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance.To maximize LLMs’ serving throughput, we introduce Atom, a low-bit quantization method that achieves highthroughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by usinglow-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracyby applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73×compared to the FP16 and by 2.53× compared to INT8 quantization, while maintaining the same latency target.


Invited Talk: Yejin Choi

Possible Impossibilities and Impossible Possibilities

Yejin Choi

 

Yejin Choi is Wissner-Slivka Professor and a MacArthur Fellow at the Paul G. Allen School of Computer Science & Engineering at the University of Washington. She is also a senior director at AI2 overseeing the project Mosaic and a Distinguished Research Fellow at the Institute for Ethics in AI at the University of Oxford. Her research investigates if (and how) AI systems can learn commonsense knowledge and reasoning, if machines can (and should) learn moral reasoning, and various other problems in NLP, AI, and Vision including neuro-symbolic integration, language grounding with vision and interactions, and AI for social good. She is a co-recipient of 2 Test of Time Awards (at ACL 2021 and ICCV 2021), 8 Best/Outstanding Paper Awards (at ACL 2023, EMNLP 2023, NAACL 2022, ICML 2022, NeurIPS 2021, AAAI 2019, and ICCV 2013), the Borg Early Career Award (BECA) in 2018, the inaugural Alexa Prize Challenge in 2017, and IEEE AI's 10 to Watch in 2016.



Poster: Large Language Models 1 Tue 14 May 01:30 p.m.  

Poster
Zhenyu Zhang · Shiwei Liu · Runjin Chen · Bhavya Kailkhura · Beidi Chen · Atlas Wang

[ Poster Position Number ]

Abstract
This paper focuses on addressing the substantial memory footprints and bandwidth costs associated with the deployment of Large Language Models (LLMs). LLMs, characterized by their extensive context length (e.g., $\geq$4096), inherently demands vast memory resource and traffic to store and load the attention key and value embeddings within self-attention modules, referred to as the KV cache. In an effort to alleviate these resource-intensive aspects of LLM inference, techniques such as sparsification and quantization for KV cache reduction have been investigated as separate endeavors within the realm of LLMs. However, this paper illuminates the critical importance of considering the compound effects of these techniques when employed together, as a simplistic amalgamation of sparsification and quantization can yield sub-optimal performance.For instance, the "Heavy Hitter Oracle" has demonstrated that preserving just 20\% of the KV cache attributed to pivotal tokens, denoted as "Heavy Hitters", can yield substantial memory savings while upholding the model's original performance. Furthermore, the KV cache of these "Heavy Hitter" tokens, which are identified as those with the highest accumulated attention scores, can be further quantized with encouraging throughput saving.Nevertheless, our investigation uncovers two primary deficiencies in such unrefined post-sparsification quantization in low-bit scenarios: (1) the application of low-bit KV …
Poster
Yunhao Yang · Neel P. Bhatt · Tyler Ingebrand · William Ward · Steven Carr · Atlas Wang · Ufuk Topcu

[ Poster Position Number ]

Abstract

Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address this limitation, however, sourcing human feedback is labor intensive and costly. We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems, bridging the gap between generic knowledge and domain-specific requirements while reducing cost. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. These controllers are verifiable against independently provided specifications within a world model, which can be abstract or obtained from a high-fidelity simulator. Controllers with high compliance with the desired specifications receive higher ranks, guiding the iterative fine-tuning process. We provide quantitative evidences, primarily in autonomous driving, to demonstrate the method's effectiveness across multiple tasks. The results indicate an improvement in percentage of specifications satisfied by the controller from 60\% to 90\%.

Poster
Lequn Chen · Zihao Ye · Yongji Wu · Danyang Zhuo · Luis Ceze · Arvind Krishnamurthy

[ Poster Position Number ]

Abstract

Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains.We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token.

Poster
Ying Sheng · Shiyi Cao · Dacheng Li · Coleman Hooper · Nicholas Lee · Shuo Yang · Christopher Chou · Banghua Zhu · Lianmin Zheng · Kurt Keutzer · Joseph Gonzalez · Ion Stoica

[ Poster Position Number ]

Abstract

The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present SLoRA, a system designed for the scalable serving of many LoRA adapters. SLoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, SLoRA proposes a unified memory pool. This memory pool uses a unified paging mechanism to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths.Additionally, SLoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for batched LoRA computation. Collectively, these features enable SLoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), SLoRA can improve the throughput …


Poster: Parallel and Distributed 1 Tue 14 May 03:30 p.m.  

Poster
Ye Tian · Zhen Jia · Ziyue Luo · Yida Wang · Chuan Wu

[ Poster Position Number ]

Abstract

Diffusion models have emerged as dominant performers for image generation. To support training large diffusion models, this paper studies pipeline parallel training of diffusion models and proposes DiffusionPipe, a synchronous pipeline training system that advocates innovative pipeline bubble filling technique, catering to structural characteristics of diffusion models. State-of-the-art diffusion models typically include trainable (the backbone) and non-trainable (e.g., frozen input encoders) parts. We first unify optimal stage partitioning and pipeline scheduling of single and multiple backbones in representative diffusion models with a dynamic programming approach. We then propose to fill the computation of non-trainable model parts into idle periods of the pipeline training of the backbones by an efficient greedy algorithm, thus achieving high training throughput. Extensive experiments show that DiffusionPipe can achieve up to 1.41x speedup over pipeline parallel methods and 1.28x speedup over data parallel training on popular diffusion models.

Poster
Alok Tripathy · Katherine Yelick · Aydin Buluc

[ Poster Position Number ]

Abstract
Graph Neural Networks (GNNs) offer a compact and computationally efficient way to learn embeddings and classifications on graph data. GNN models are frequently large, making distributed minibatch training necessary.The primary contribution of this paper is new methods for reducing communication in the sampling step for distributed GNN training. Here, we propose a *matrix-based bulk sampling* approach that expresses sampling as a sparse matrix multiplication (SpGEMM) and samples multiple minibatches at once. When the input graph topology does not fit on a single device, our method distributes the graph and use communication-avoiding SpGEMM algorithms to scale GNN minibatch sampling, enabling GNN training on much larger graphs than those that can fit into a single device memory. When the input graph topology (but not the embeddings) fits in the memory of one GPU, our approach (1) performs sampling without communication, (2) amortizes the overheads of sampling a minibatch, and (3) can represent multiple sampling algorithms by simply using different matrix constructions. In addition to new methods for sampling, we show that judiciously replicating feature data with a simple *all-to-all* exchange can outperform current methods for the feature extraction step in distributed GNN training. We provide experimental results on the largest Open Graph …
Poster
Ilia Markov · Kaveh Alim · Elias Frantar · Dan Alistarh

[ Poster Position Number ]

Abstract
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks. To address this issue, entire families of compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy.In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, improving the overall compression, while leading to substantial speedups, without sacrificing accuracy. Our framework, called L-GreCo, is based on an adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while satisfying an error constraint. Extensive experiments over image classification and language modeling tasks shows that L-GreCo is effective across all existing families of compression methods, and achieves up to 2.5$\times$ training speedup and up to 5$\times$ compression improvement over efficient implementations of existing approaches, while recovering full accuracy. Moreover, L-GreCo is complementary to existing adaptive algorithms, improving their compression ratio by 50\% and practical throughput by 66\%. An …

Poster: Privacy and security Tue 14 May 04:30 p.m.  

Poster
Kiwan Maeng · G. Edward Suh

[ Poster Position Number ]

Abstract
Secure multi-party computation (MPC) allows users to offload machine learning inference on untrusted servers without having to share their privacy-sensitive data. Despite their strong security properties, MPC-based private inference has not been widely adopted due to their high communication overhead, mostly incurred when evaluating non-linear layers.This paper presents HummingBird, an MPC framework that reduces the ReLU communication overhead significantly. HummingBird leverages an insight that determining whether a value is positive or negative mostly does not need a full-bit communication.With its theoretical analyses and an efficient search engine, HummingBird discards 66--72% of the bits during ReLU without altering the outcome, and discards 87--91% when some accuracy can be degraded. On a realistic MPC setup, HummingBird achieves on average 2.03--2.67$\times$ end-to-end speedup without introducing any errors, and up to 8.42$\times$ when some accuracy degradation is tolerated.
Poster
Jingtian Dang · Jianming Tong · Anupam Golder · Cong "Callie" Hao · Arijit Raychowdhury · Tushar Krishna

[ Poster Position Number ]

Abstract

As machine learning (ML) permeates fields like healthcare, facial recognition, and blockchain, the need to protect sensitive data intensifies. Fully Homomorphic Encryption (FHE) allows inference on encrypted data, preserving the privacy of both data and the ML model. However, it slows down non-secure inference by up to five magnitudes, with a root cause of replacing non-polynomial operators (ReLU and MaxPooling) with high-degree Polynomial Approximated Function (PAF).We propose SmartPAF, a framework to replace non-polynomial operators with low-degree PAF and then recover the accuracy of PAF-approximated model through four techniques: (1) Coefficient Tuning (CT) -- adjust PAF coefficients based on the input distributions before training, (2) Progressive Approximation (PA) -- progressively replace one non-polynomial operator at a time followed by a fine-tuning, (3) Alternate Training (AT) -- alternate the training between PAFs and other linear operators in the decoupled manner, and (4) Dynamic Scale (DS) / Static Scale (SS) -- dynamically scale PAF input value within [-1, 1] in training, and fix the scale as the running max value in FHE deployment.The synergistic effect of CT, PA, AT, and DS/SS enables SmartPAF to enhance the accuracy of the various models approximated by PAFs with various low degrees under multiple datasets. For ResNet-18 …

Poster
Yubo Gao · Maryam Haghifam · Christina Giannoula · Renbo Tu · Gennady Pekhimenko · Nandita Vijaykumar

[ Poster Position Number ]

Abstract
Deep learning (DL) models have revolutionized numerous domains, yet optimizing them for computational efficiency remains a challenging endeavor. Development of new DL models typically involves two parties: the model developers and performance optimizers. The exchange between the parties often necessitates exposing the model architecture and computational graph to the optimizers. However, this exposure is undesirable since the model architecture is an important intellectual property, and its innovations require significant investments and expertise. During the exchange, the model is also vulnerable to adversarial attacks via model stealing.This paper presents Proteus, a novel mechanism that enables model optimization by an independent party while preserving the confidentiality of the model architecture. Proteus obfuscates the protected model by partitioning its computational graph into subgraphs and concealing each subgraph within a large pool of generated realistic subgraphs that cannot be easily distinguished from the original. We evaluate Proteus on a range of DNNs, demonstrating its efficacy in preserving confidentiality without compromising performance optimization opportunities. Proteus effectively hides the model as one alternative among up to $10^{32}$ possible model architectures, and is resilient against attacks with a learning-based adversary. We also demonstrate that heuristicbased and manual approaches are ineffective in identifying the protected model.To our knowledge, …

Reception & Poster Session Tue 14 May 05:30 p.m.