Skip to yearly menu bar Skip to main content


Timezone: US/Pacific

Registration Desk: Registration and Check-in Wed 14 May 08:00 a.m.  


Poster: Session 5: LLM training and fine-tuning Wed 14 May 08:30 a.m.  

Poster
Tianle Zhong · Jiechen Zhao · Qiang Su · Geoffrey Fox

[ Mission City Ballroom ]

Abstract
Large language models (LLMs) training is extremely data-intensive, often involving over trillion-level tokens. Although LLM datasets are usually ingested and stored in columnar formats, they often need to be converted into another format for training, which incurs significant storage and maintenance costs due to extra data copies. While eliminating the conversion would save tens of terabytes of space in costly high performance storage, this work identifies challenges that drive us to re-think the entire data pipeline. Without conversion, we find that fine-grained random access patterns incur hundreds of times efficiency drops.Specifically, the existing data pipelines have two fundamental drawbacks: (1) They cannot efficiently support directly digesting data in columnar format due to default coarse-grained I/O; (2) Solutions to the first drawback sacrifice memory footprint to cache datasets. In this paper, we present Youmu, a new data pipeline that directly feeds fine-grained columnar data into GPUs, enabling cost-efficient LLM training. Meanwhile, Youmu maintains high training accuracy, whose perplexity outperforms widely adopted local shuffle by reducing 0.3-0.7 for pretraining. Compared to performance-optimal state-of-the-art, distributed memory-based pipelines, Youmu achieves comparable throughput with $\sim$80\% less memory footprint.
Poster
Jinghan Yao · Sam Jacobs · Masahiro Tanaka · Olatunji Ruwase · Hari Subramoni · Dhabaleswar Panda

[ Mission City Ballroom ]

Abstract
Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with outstanding hardware efficiency.For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU.Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models. The code is available.
Poster
Yujin Wang · Shunan Dong · Zongle Huang · Yichen You · Liu He · Huazhong Yang · Yongpan Liu · Hongyang Jia

[ Mission City Ballroom ]

Abstract
Large Language Models (LLMs) are widely used in applications like conversation and text summarization. With the demand for model customization and privacy, lightweight fine-tuning methods for large models have begun to receive widespread attention. Low-Rank Adaption (LoRA) is one of the most widely used fine-tuning algorithms, which significantly reduces the tunable weights and associated optimizer memory when transferring pre-trained LLMs to downstream tasks. However, past works lacked attention to the overhead of buffered activations in low-rank adaption, leading to suboptimal system memory usage.To reduce buffered activation memory consumption and further enable the on-device memory-efficient fine-tuning system, we propose \textbf{HyC-LoRA}, a variant of the LoRA training method using a hybrid compression framework enabling almost 2-bit buffered activation quantization in all operators. HyC-LoRA observes that the temporarily buffered activation for backpropagation dominates the memory consumption in the LoRA fine-tuning process, and those in non-linear modules act as dominant memory consumers, whose quantization is more challenging. Based on this, HyC-LoRA proposes a hybrid compression mechanism with two tiers: \textbf{(1)} \textit{\textbf{Intra-operator hybrid compression}}: HyC-LoRA detects extreme outliers in buffered activation and mitigates the quantization error by structured outlier storage; \textbf{(2)} \textit{\textbf{Inter-operator hybrid compression}}: HyC-LoRA utilizes the LoRA adapter to achieve compensation for quantization errors …
Poster
Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee

[ Mission City Ballroom ]

Abstract
Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).In this work, we investigate the redundancy in AdamW's learning rate adaptation rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection. The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance. Moreover, we further propose an extreme memory-efficient version, APOLLO-MINI, …
Poster
Mingyu Liang · Hiwot Kassa · Wenyin Fu · Brian Coutinho · Louis Feng · Christina Delimitrou

[ Mission City Ballroom ]

Abstract
Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model’s behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
Poster
Zhiyu Mei · WEI FU · Kaiwei Li · Guangju Wang · Huanchen Zhang · Yi Wu

[ Mission City Ballroom ]

Abstract
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. Compared with the supervised training process of LLMs, the RLHF training process is much more sophisticated, requiring a diverse range of computation workloads with intricate dependencies between multiple LLM instances. Therefore, simply adopting the fixed parallelization strategies from supervised training for LLMs can be insufficient for RLHF and result in low training efficiency. To overcome this limitation, we propose a novel technique named parameter ReaLlocation, which dynamically adapts the parallelization strategies for different workloads during training by redistributing LLM parameters across the training cluster. Building upon this idea, we introduce ReaL, a pioneering system for efficient RLHF training. ReaL introduces the concept of an execution plan, which defines a fine-grained resource allocation and parallelization strategy particularly designed for RLHF training. Based on this concept, ReaL employs a tailored search algorithm with a lightweight run-time estimator to automatically discover an efficient execution plan for an instance of RLHFexperiment. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs. The experimental results demonstrate that …

Invited Talk: Animashree Anandkumar

Hardware-aware training and inference for large-scale AI

The scaling of large language models has led to impressive gains in language understanding, but at a cost of insatiable memory and bandwidth requirements. We take a principled approach of designing optimization and quantization algorithms that can reduce memory requirements without sacrificing accuracy. This includes gradient compression methods (GaLore, SignSGD) and logarithmic number system for representation. We also design fine-grained memory reduction schemes such as KV cache compression, chunking and offloading to overcome memory bottlenecks in language models, especially in the reasoning mode where current memory requirements are massive. Such principles are broadly applicable and especially relevant to physical AI where the memory and bandwidth requirements are even greater than frontier LLMs.

Animashree Anandkumar

 

Professor Anandkumar's research interests are in the areas of large-scale machine learning, non-convex optimization and high-dimensional statistics. In particular, she has been spearheading the development and analysis of tensor algorithms for machine learning. Tensor decomposition methods are embarrassingly parallel and scalable to enormous datasets. They are guaranteed to converge to the global optimum and yield consistent estimates for many probabilistic models such as topic models, community models, and hidden Markov models. More generally, Professor Anandkumar has been investigating efficient techniques to speed up non-convex optimization such as escaping saddle points efficiently.



Poster: Session 6: Edge and Cloud Systems Wed 14 May 01:15 p.m.  

Poster
Kasper Overgaard Mortensen · Konstantinos Skitsas · Emil Morre Christensen · Mohammad Sadegh Talebi · Andreas Pavlogiannis · Davide Mottin · Panagiotis Karras

[ Mission City Ballroom ]

Abstract
Markov decision process (MDPs) find application wherever a decision-making agent acts and learns in an uncertain environment from facility management to healthcare and service provisioning. However, finding the optimal policy such an agent should follow raises high computational cost, calling for solutions that scale to large numbers of actions and states? In this paper, we propose SwiftVI, a suite of algorithms that solve MDPs scalably by organizing the set of actions for each state in priority queues and deriving bounds for backup Q-values. Our championed solution prunes the set of actions at each state utilizing a tight upper bound and a single priority queue. A thorough experimental study confirms that SwiftVI algorithms achieve high efficiency gains robustly to model parameters.
Poster
Lu Wang · Mayukh Das · Fangkai Yang · Bo Qiao · Hang Dong · Si Qin · Victor Ruehle · Chetan Bansal · Eli Cortez · Íñigo Goiri · S R · Qingwei Lin · Dongmei Zhang

[ Mission City Ballroom ]

Abstract
Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to minimize revenue loss against redundant capacity. While resources can be of any type, including compute, memory, power or network bandwidth, we highlight the scenario of virtual CPU (vCPU) oversubscription since vCPU cores are primarily the billable units for cloud services and has substantial impact on business as well as users. For a seamless cloud experience, while being cost-efficient for the provider, suitable policies for controlling oversubscription margins are crucial. Narrow margins lead to redundant expenditure on under-utilized resource capacity, and wider margins lead to under-provisioning where customer workloads may suffer from resource contention. Most oversubscription policies today are engineered either with tribal knowledge or with static heuristics about the system, which lead to catastrophic overloading or stranded/under-utilized resources. Smart oversubscription policies that can adapt to demand/utilization patterns across time and granularity to jointly optimize cost benefits and risks is a non-trivial, largely, unsolved problem. We …
Poster
Chenxi Yang · Yan Li · Martin Maas · Mustafa Uysal · Ubaid Hafeez · Arif Merchant · Richard McDougall

[ Mission City Ballroom ]

Abstract
Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at $AnonCorp$, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at $AnonCorp$. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47$\times$ in …
Poster
Baichuan Huang · Amir Aminifar

[ Mission City Ballroom ]

Abstract
The training of the state-of-the-art Deep Neural Networks (DNNs) consumes massive amounts of energy, while the human brain learns new tasks with remarkable efficiency. Currently, the training of DNNs relies almost exclusively on Backpropagation (BP). However, BP faces criticism due to its biologically implausible nature, underscoring the significant disparity in performance and energy efficiency between DNNs and the human brain. Forward-only algorithms are proposed to be the biologically plausible alternatives to BP, to better mimic the learning process of the human brain and enhance energy efficiency. In this paper, we propose a biologically-plausible forward-only algorithm (Bio-FO), not only targeting the biological-implausibility issues associated with BP, but also outperforming the state-of-the-art forward-only algorithms. We extensively evaluate our proposed Bio-FO against other forward-only algorithms and demonstrate its performance across diverse datasets, including two real-world medical applications on wearable devices with limited resources and relatively large-scale datasets such as mini-ImageNet. At the same time, we implement our proposed on-device learning algorithm on the NVIDIA Jetson Nano and demonstrate its efficiency compared to other state-of-the-art forward-only algorithms. The code is available at https://github.com/whubaichuan/Bio-FO.
Poster
Shu Liu · Asim Biswal · Audrey Cheng · Amog Kamsetty · Luis Gaspar Schroeder · Liana Patel · Shiyi Cao · Xiangxi Mo · Ion Stoica · Joseph Gonzalez · Matei Zaharia

[ Mission City Ballroom ]

Abstract
Batch data analytics has become a growing application for Large Language Models (LLMs). LLMs enable usersto perform a wide range of natural language tasks, such as classification, entity extraction, and translation, overlarge datasets. However, LLM inference is highly expensive in both computational and monetary costs: forexample, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a dayto handle 15 GB of data; and processing a similar amount of data costs around $10K on OpenAI’s GPT-4o. In thispaper, we propose novel techniques that can significantly reduce the cost of LLM calls for relational data analyticsworkloads. Our key contribution is developing efficient algorithms for reordering the rows and the fields witheach row of an input table to maximize key-value (KV) cache reuse when performing LLM serving. Our approachcan be easily applied to existing analytics systems and serving platforms. Evaluations show that our solution canyield up to 3.4× improvement in end-to-end latency on a benchmark of diverse LLM-based queries using Llama 3models. Our solutions also achieve 32% cost savings using OpenAI and Anthropic prefix cache pricing models.

Poster: Session 7: Quantization and Sparsity Wed 14 May 02:40 p.m.  

Poster
Shang Yang · Junxian Guo · Haotian Tang · Qinghao Hu · Guangxuan Xiao · Jiaming Tang · Yujun Lin · Zhijian Liu · Yao Lu · Song Han

[ Mission City Ballroom ]

Abstract
Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via unified sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. For Llama-3-8B, LServe accelerates LLM prefilling by an average of 2.4x and decoding by up to 3.3x over TensorRT-LLM, maintaining long-context accuracy. The code will be released upon …
Poster
Francesco Daghero · Daniele Jahier Pagliari · Francesco Conti · Luca Benini · Massimo Poncino · Alessio Burrello

[ Mission City Ballroom ]

Abstract
The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices.In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1$\times$ and 3.4$\times$ faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9$\times$ extra speedup, at the cost of a 5\% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21$\times$ and 1.81$\times$ on a ResNet18 and a Vision Transformer (ViT), with less than 1.5\% accuracy drop compared to a dense baseline.
Poster
Qianchao Zhu · Jiangfei Duan · Chang Chen · Siran Liu · Xiuhong Li · Guanyu Feng · Xin Lv · Xiao Chuanfu · Dahua Lin · Chao Yang

[ Mission City Ballroom ]

Abstract
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Exisiting sparse attention approaches employ either static sparse pattern or fixed sparsity ratio to utilize the high attention sparsity, failing to capture the adaptive sparsity ratio and dynamic sparse pattern across attention heads, input contents and model architectures. To balance accuracy and performance efficiently, we introduce a robust indicator for accuracy, Cumulative Residual Attention (CRA), which measures the percentage of attention recall.Leveraging this key insight, we present SampleAttention, which employs a novel two-stage query-guided key-value filtering approach to efficiently and dynamically select a minimal set of important column and slash strips to meet a desired CRA threshold, thus maximizing efficiency while preserving accuracy. Comprehensive evaluations show that SampleAttention can establish a new Pareto frontier in the accuracy-efficiency trade-off, and reduces TTFT by up to $5.29\times$ compared with FlashAttention2.
Poster
Marco Federici · Davide Belli · Mart van Baalen · Amir Jalalirad · Andrii Skliar · Bence Major · Markus Nagel · Paul Whatmough

[ Mission City Ballroom ]

Abstract
While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower.This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound.Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token.However, more recent LLMs use SwiGLU instead of ReLU, which results in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective.To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach,which preserves accuracy with minimal fine-tuning.DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices.DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46\% reduction in memory and 40\% increase in throughput with $<$ 0.1 loss in perplexity when compared to streaming the dense model from Flash.
Poster
Md Saidul Hoque Anik · Ariful Azad

[ Mission City Ballroom ]

Abstract
Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well. An implementation of the SpTransX framework is publicly available as a Python package in https://github.com/HipGraph/SpTransX.

Poster: Session 8: LLM and Diffusion Model Serving Wed 14 May 04:30 p.m.  

Poster
Qidong Su · Wei Zhao · Xin Li · Muralidhar Andoorveedu · Chenhao Jiang · Zhanda Zhu · Kevin Song · Christina Giannoula · Gennady Pekhimenko

[ Mission City Ballroom ]

Abstract
To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference—prefilling and decoding—render a single static parallelization strategy insufficient for the effective optimization of both stages.In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases.To mitigate re-sharing overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency.Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78$\times$ (1.36$\times$ on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.
Poster
Jiacheng Yang · Jun Wu · Zhen Zhang · Xinwei Fu · Zhiying Xu · Zhen Jia · Yida Wang · Gennady Pekhimenko

[ Mission City Ballroom ]

Abstract
Recent advancements in training diffusion models have made generating high-quality videos possible. Particularly, the spatial-temporal diffusion transformers (ST-DiTs) emerge as a promising diffusion model architecture for generating videos of high-resolution (1080p) and long duration (20 seconds). However, the quadratic scaling of compute cost with respect to resolution and duration, primarily due to spatial-temporal attention layers processing longer sequences, results in high inference latency of ST-DiTs. This hinders their applicability in time-sensitive scenarios. Existing sequence parallelism techniques, such as DeepSpeed-Ulysses and RingAttention, are not optimally scalable for ST-DiT inference across multiple GPU machines due to cross-machine communication overheads. To address this challenge, we introduce ScaleFusion, a scalable inference engine designed to optimize ST-DiT inference for high-resolution, long video generation. By leveraging the inherent structure of spatial-temporal attention layers, ScaleFusion effectively hides cross-machine communication overhead through novel intra-layer and inter-layer communication scheduling algorithms. This enables strong scaling of 3.60$\times$ on 4 Amazon EC2 p4d.24xlarge machines (32 A100 GPUs) against 1 machine (8 A100 GPUs). Our experiments demonstrate that ScaleFusion surpasses state-of-the-art techniques, achieving an average speedup of 1.36$\times$ (up to 1.58$\times$).
Poster
Hao Kang · Srikant Bharadwaj · James Hensman · Tushar Krishna · Victor Ruehle · Saravan Rajmohan

[ Mission City Ballroom ]

Abstract
Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanisms. While techniques, such as quantization, and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operations.We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.
Poster
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim

[ Mission City Ballroom ]

Abstract
LLMs have achieved remarkable performance across various fields, prompting data centers to use high computation cost accelerators like GPUs and NPUs for model training and inference. However, LLM’s large model sizes and related key-value (KV) caches create significant memory capacity challenges. To address this, offloading-based techniques leverage CPU memory for storing model weights and KV cache, allowing models larger than GPU memory to be served. However, these approaches often encounter performance bottlenecks due to PCIe transfer latency and fail to effectively leverage the potential of CPU computation. To address the performance limitations of existing offloading-based LLM inference in CPU and memory-limited single GPU systems, this paper proposes FlexInfer. FlexInfer uses a performance estimator to dynamically select the most appropriate execution policy for each phase—prefill and decode—based on their distinct characteristics. Our evaluation results show that by selecting optimal policies for these phases, FlexInfer can significantly reduce end-to-end latency by 75.2% and 77% on average across two different server configurations for various models such as OPT and LLaMA compared to FlexGen, the state-of-the-art offload-based LLM inference technique.
Poster
Ke Hong · Xiuhong Li · Lufang Chen · Qiuli Mao · Guohao Dai · Xuefei Ning · Shengen Yan · Yun Liang · Yu Wang

[ Mission City Ballroom ]

Abstract
Serving large language models (LLMs) efficiently requires elaborate request scheduling to satisfy service-level objectives (SLOs).In the context of LLM serving, SLOs include the constraints on Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT).Existing serving systems apply a coarse-grained request scheduling that follows a fixed principle at different iterations during the serving procedure, leading to (1) a significant distribution bias between TTFT and TPOT and (2) a significant distribution variance among different requests as shown in Fig. 1(a), and hence causes disappointing SLO attainment.We identify that fine-grained scheduling based on a formal description of the design space addresses the issues mentioned above.To this end, we first formulate a scheduling design space with flexible control of the request execution order and the workload at each iteration. Based on that, we introduce a state-aware scheduling strategy, which enables the awareness of two kinds of states: the states from the single request perspective and the states from the systemic perspective, and further balances between TTFT and TPOT and balances among different requests to improve the SLO attainment, as shown in Fig. 2. We implement SOLA with the above insights. The evaluation shows that SOLA enhances the SLO attainment from 45.5\% to 99.4\%, thus serving more requests. Given …

Session: Poster Session - Optional Wed 14 May 06:00 p.m.