Poster
Yujun Lin · Haotian Tang · Shang Yang · Zhekai Zhang · Guangxuan Xiao · Chuang Gan · Song Han
[ Mission City Ballroom ]
Abstract
Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when dequantizing either weights or partial sums on GPUs. To address this challenge, we introduce QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for quattuor-oct ¯o-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also transfer theoretical memory saving brought by KV4 attention into measured …
Poster
Kasper Overgaard Mortensen · Konstantinos Skitsas · Emil Morre Christensen · Mohammad Sadegh Talebi · Andreas Pavlogiannis · Davide Mottin · Panagiotis Karras
[ Mission City Ballroom ]
Abstract
Markov decision process (MDPs) find application wherever a decision-making agent acts and learns in an uncertain environment from facility management to healthcare and service provisioning. However, finding the optimal policy such an agent should follow raises high computational cost, calling for solutions that scale to large numbers of actions and states? In this paper, we propose SwiftVI, a suite of algorithms that solve MDPs scalably by organizing the set of actions for each state in priority queues and deriving bounds for backup Q-values. Our championed solution prunes the set of actions at each state utilizing a tight upper bound and a single priority queue. A thorough experimental study confirms that SwiftVI algorithms achieve high efficiency gains robustly to model parameters.
Poster
Zaifeng Pan · Yitong Ding · Yue Guan · Zheng Wang · Zhongkai Yu · Xulong Tang · Yida Wang · Yufei Ding
[ Mission City Ballroom ]
Abstract
Tree-structured prefix sharing is prevalent in recent large language model (LLM) applications. Existing LLM serving systems use a radix tree to organize the global key-value (KV) cache, facilitating cache reuse across different queries and thus reducing unnecessary memory use. Despite this, these systems still rely on conventional computation patterns for attention operations, resulting in redundant memory loads and GPU tensor core underutilization. To address these limitations, we present FastTree, which introduces GPU kernels tailored for efficiently processing queries that share contexts through the radix tree. To effectively employ the FastTree kernels, a significant challenge arises in finding optimal context-queries groups with a given KV cache tree, as the varying shared prefixes between queries create a giant decision space. To tackle this, we propose tree structure-adaptive runtime optimization within FastTree, applying a greedy heuristic to partition the tree to minimize overhead and splitting lengthy contexts to mitigate the tail effect. FastTree is built upon SGLang, and extensive experiments demonstrate that it improves the throughput of SGLang by up to 2.2×. FastTree’s code is available at https://github.com/PanZaifeng/FastTree-Artifact.
Poster
Lu Wang · Mayukh Das · Fangkai Yang · Bo Qiao · Hang Dong · Si Qin · Victor Ruehle · Chetan Bansal · Eli Cortez · Íñigo Goiri · S R · Qingwei Lin · Dongmei Zhang
[ Mission City Ballroom ]
Abstract
Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to minimize revenue loss against redundant capacity. While resources can be of any type, including compute, memory, power or network bandwidth, we highlight the scenario of virtual CPU (vCPU) oversubscription since vCPU cores are primarily the billable units for cloud services and has substantial impact on business as well as users. For a seamless cloud experience, while being cost-efficient for the provider, suitable policies for controlling oversubscription margins are crucial. Narrow margins lead to redundant expenditure on under-utilized resource capacity, and wider margins lead to under-provisioning where customer workloads may suffer from resource contention. Most oversubscription policies today are engineered either with tribal knowledge or with static heuristics about the system, which lead to catastrophic overloading or stranded/under-utilized resources. Smart oversubscription policies that can adapt to demand/utilization patterns across time and granularity to jointly optimize cost benefits and risks is a non-trivial, largely, unsolved problem. We …
Poster
Yue Gao · Ilia Shumailov · Kassem Fawaz
[ Mission City Ballroom ]
Abstract
Machine learning (ML) systems are increasingly vulnerable to supply-chain attacks that exploit the intricate dependencies inherent in open-source software (OSS). However, securing the ML ecosystem remains challenging due to regular paradigmatic changes in the ecosystem, their dynamic runtime environments, and lack of security awareness in open-source ML projects. In this paper, we introduce a novel class of supply-chain attacks that specifically target ML models, relying on inherent insecurity of Python as a programming language. Such attacks leverage traditional supply-chain vulnerabilities to inject innocuous-looking code that weakens the ML model's robustness. We then conduct an LLM-assisted analysis of discussions from the top 50 ML projects on GitHub to understand the current state of supply-chain security awareness among contributors. Despite the need for a higher standard of security practices, our findings reveal a similar level of security awareness between the ML and non-ML communities, highlighting the need for enhanced safeguards against ML-specific supply-chain attacks.
Poster
Chendong Wang · Anlan Zhang · Yifan Yang · Lili Qiu · Yuqing Yang · XINYANG JIANG · Feng Qian · Suman Banerjee
[ Mission City Ballroom ]
Abstract
3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can then be upsampled at the receiver's end using a super-resolution (SR) algorithm to reconstruct the high-resolution details. While super-resolution techniques have been extensively explored and advanced for 2D video content, there is limited work on SR algorithms tailored for volumetric videos. To address this gap and the growing need for efficient volumetric video streaming, we have developed VoLUT with a new SR algorithm specifically designed for volumetric content. Our algorithm uniquely harnesses the power of lookup tables (LUTs) to facilitate the efficient and accurate upscaling of low-resolution volumetric data. The use of LUTs enables our algorithm to quickly reference precomputed high-resolution values, thereby significantly reducing the computational complexity and time required for upscaling. We further apply adaptive video bit rate algorithm (ABR) to dynamically determine the downsampling rate according to the network condition and stream the selected video rate to the …
Poster
Zichao Yue · Chenhui Deng · Zhiru Zhang
[ Mission City Ballroom ]
Abstract
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue.Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored.This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.
Poster
Neel P. Bhatt · Yunhao Yang · Rohan Siva · Daniel Milan · Ufuk Topcu · Atlas Wang
[ Mission City Ballroom ]
Abstract
Multimodal foundation models offer a promising framework for robotic perception and planning by processing sensory inputs to generate actionable plans. However, addressing uncertainty in both perception (sensory interpretation) and decision-making (plan generation) remains a critical challenge for ensuring task reliability. This paper presents a comprehensive framework to disentangle, quantify, and mitigate these two forms of uncertainty. We first introduce a framework for uncertainty $\textit{disentanglement}$, isolating $\textit{perception uncertainty}$ arising from limitations in visual understanding and $\textit{decision uncertainty}$ relating to the robustness of generated plans.To quantify each type of uncertainty, we propose methods tailored to the unique properties of perception and decision-making: we use conformal prediction to calibrate perception uncertainty and introduce Formal-Methods-Driven Prediction (FMDP) to quantify decision uncertainty, leveraging formal verification techniques for theoretical guarantees. Building on this quantification, we implement two targeted $\textit{intervention}$ mechanisms: an active sensing process that dynamically re-observes high-uncertainty scenes to enhance visual input quality and an automated refinement procedure that fine-tunes the model on high-certainty data, improving its capability to meet task specifications. Empirical validation in real-world and simulated robotic tasks demonstrates that our uncertainty disentanglement framework reduces variability by up to 40\% and enhances task success rates by 5\% compared to baselines. These improvements are …
Poster
Tianle Zhong · Jiechen Zhao · Qiang Su · Geoffrey Fox
[ Mission City Ballroom ]
Abstract
Large language models (LLMs) training is extremely data-intensive, often involving over trillion-level tokens. Although LLM datasets are usually ingested and stored in columnar formats, they often need to be converted into another format for training, which incurs significant storage and maintenance costs due to extra data copies. While eliminating the conversion would save tens of terabytes of space in costly high performance storage, this work identifies challenges that drive us to re-think the entire data pipeline. Without conversion, we find that fine-grained random access patterns incur hundreds of times efficiency drops.Specifically, the existing data pipelines have two fundamental drawbacks: (1) They cannot efficiently support directly digesting data in columnar format due to default coarse-grained I/O; (2) Solutions to the first drawback sacrifice memory footprint to cache datasets. In this paper, we present Youmu, a new data pipeline that directly feeds fine-grained columnar data into GPUs, enabling cost-efficient LLM training. Meanwhile, Youmu maintains high training accuracy, whose perplexity outperforms widely adopted local shuffle by reducing 0.3-0.7 for pretraining. Compared to performance-optimal state-of-the-art, distributed memory-based pipelines, Youmu achieves comparable throughput with $\sim$80\% less memory footprint.
Poster
Chenxi Yang · Yan Li · Martin Maas · Mustafa Uysal · Ubaid Hafeez · Arif Merchant · Richard McDougall
[ Mission City Ballroom ]
Abstract
Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at $AnonCorp$, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at $AnonCorp$. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47$\times$ in …
Poster
Shang Yang · Junxian Guo · Haotian Tang · Qinghao Hu · Guangxuan Xiao · Jiaming Tang · Yujun Lin · Zhijian Liu · Yao Lu · Song Han
[ Mission City Ballroom ]
Abstract
Large language models (LLMs) have shown remarkable potential in processing long sequences, yet efficiently serving these long-context models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via unified sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. For Llama-3-8B, LServe accelerates LLM prefilling by an average of 2.4x and decoding by up to 3.3x over TensorRT-LLM, maintaining long-context accuracy. The code will be released upon …
Poster
Sohaib Ahmad · Qizheng Yang · Haoliang Wang · Ramesh Sitaraman · Hui Guan
[ Mission City Ballroom ]
Abstract
Text-to-image generation using diffusion models has gained increasing popularity due to their ability to produce high-quality, realistic images based on text prompts. However, efficiently serving these models is challenging due to their computation-intensive nature and the variation in query demands. In this paper, we aim to address both problems simultaneously through query-aware model scaling. The core idea is to construct model cascades so that easy queries can be processed by more lightweight diffusion models without compromising image generation quality. Based on this concept, we develop an end-to-end text-to-image diffusion model serving system, DiffServe, which automatically constructs model cascades from available diffusion model variants and allocates resources dynamically in response to demand fluctuations.Our empirical evaluations demonstrate that DiffServe achieves up to 24\% improvement in response quality while maintaining 19-70\% lower latency violation rates compared to state-of-the-art model serving systems.
Poster
Rya Sanovar · Srikant Bharadwaj · Renée St. Amant · Victor Ruehle · Saravan Rajmohan
[ Mission City Ballroom ]
Abstract
Transformer-based large language models are memory hungry and incur significant inference latencies evenon cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attentionoperation is quadratic in terms of the total context length, i.e., prompt and output tokens.To that end, we propose LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanismfor the decode-phase of transformer-based models. LeanAttention enables scaling the attention mechanism for thechallenging case of long context lengths by re-designing the attention execution flow for the decode-phase. As aresult, we achieve an average of 1.73x speedup in attention execution compared to FlashDecoding, with up to2.18x speedup for 256k context length.
Poster
Jinghan Yao · Sam Jacobs · Masahiro Tanaka · Olatunji Ruwase · Hari Subramoni · Dhabaleswar Panda
[ Mission City Ballroom ]
Abstract
Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology, such as text generation and protein sequence analysis. However, training LLMs directly on extremely long contexts demands considerable GPU resources and increased memory, leading to higher costs and greater complexity. Alternative approaches that introduce long context capabilities via downstream finetuning or adaptations impose significant design limitations. In this paper, we propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs with outstanding hardware efficiency.For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware compared to current state-of-the-art solutions. With our dedicated sequence chunk pipeline design, we can now train 8B LLM with 2 million sequence length on only 4 GPUs, while also maintaining over 55% of MFU.Our proposed FPDT is agnostic to existing training techniques and is proven to work efficiently across different LLM models. The code is available.
Poster
Francesco Daghero · Daniele Jahier Pagliari · Francesco Conti · Luca Benini · Massimo Poncino · Alessio Burrello
[ Mission City Ballroom ]
Abstract
The acceleration of pruned Deep Neural Networks (DNNs) on edge devices such as Microcontrollers (MCUs) is a challenging task, given the tight area- and power-constraints of these devices.In this work, we propose a three-fold contribution to address this problem. First, we design a set of optimized software kernels for N:M pruned layers, targeting ultra-low-power, multicore RISC-V MCUs, which are up to 2.1$\times$ and 3.4$\times$ faster than their dense counterparts at 1:8 and 1:16 sparsity, respectively. Then, we implement a lightweight Instruction-Set Architecture (ISA) extension to accelerate the indirect load and non-zero indices decompression operations required by our kernels, obtaining up to 1.9$\times$ extra speedup, at the cost of a 5\% area overhead. Lastly, we extend an open-source DNN compiler to utilize our sparse kernels for complete networks, showing speedups of 3.21$\times$ and 1.81$\times$ on a ResNet18 and a Vision Transformer (ViT), with less than 1.5\% accuracy drop compared to a dense baseline.
Poster
Beichen Huang · Yueming Yuan · ZELEI SHAO · Minjia Zhang
[ Mission City Ballroom ]
Abstract
A critical approach for efficiently deploying Mixture-of-Experts (MoE) models with massive parameters is quantization. However, state-of-the-art MoE models suffer from non-negligible accuracy loss with extreme quantization, such as under 4 bits. To address this, we introduce MiLo, a novel method that augments highly quantized MoEs with a mixture of low-rank compensators. These compensators consume only a small amount of additional memory but significantly recover accuracy loss from extreme quantization. MiLo also identifies that MoE models exhibit distinctive characteristics across weights due to their hybrid dense-sparse architectures, and employs adaptive rank selection policies along with iterative optimizations to close the accuracy gap. MiLo does not rely on calibration data, allowing it to generalize to different MoE models and datasets without overfitting to a calibration set. To avoid the hardware inefficiencies of extreme quantization, such as 3-bit, MiLo develops Tensor Core-friendly 3-bit kernels, enabling measured latency speedups on 3-bit quantized MoE models. Our evaluation shows that MiLo outperforms existing methods on SoTA MoE models across various tasks.
Poster
Minxue Tang · Yitu Wang · Jingyang Zhang · Louis DiValentin · Aolin Ding · Amin Hass · Yiran Chen · Hai Li
[ Mission City Ballroom ]
Abstract
Federated Adversarial Training (FAT) can supplement robustness against adversarial examples to Federated Learning (FL), promoting a meaningful step toward trustworthy AI. However, FAT requires large models to preserve high accuracy while achieving strong robustness, incurring high memory-swapping latency when training on memory-constrained edge devices. Existing memory-efficient FL methods suffer from poor accuracy and weak robustness due to inconsistent local and global models. In this paper, we propose FedProphet, a novel FAT framework that can achieve memory efficiency, robustness, and consistency simultaneously. FedProphet reduces the memory requirement in local training while guaranteeing adversarial robustness by adversarial cascade learning with strong convexity regularization, and we show that the strong robustness also implies low inconsistency in FedProphet. We also develop a training coordinator on the server of FL, with Adaptive Perturbation Adjustment for utility-robustness balance and Differentiated Module Assignment for objective inconsistency mitigation. FedProphet significantly outperforms other baselines under different experimental settings, maintaining the accuracy and robustness of end-to-end FAT with 80% memory reduction and up to 10.8x speedup in training time.
Poster
Baichuan Huang · Amir Aminifar
[ Mission City Ballroom ]
Abstract
The training of the state-of-the-art Deep Neural Networks (DNNs) consumes massive amounts of energy, while the human brain learns new tasks with remarkable efficiency. Currently, the training of DNNs relies almost exclusively on Backpropagation (BP). However, BP faces criticism due to its biologically implausible nature, underscoring the significant disparity in performance and energy efficiency between DNNs and the human brain. Forward-only algorithms are proposed to be the biologically plausible alternatives to BP, to better mimic the learning process of the human brain and enhance energy efficiency. In this paper, we propose a biologically-plausible forward-only algorithm (Bio-FO), not only targeting the biological-implausibility issues associated with BP, but also outperforming the state-of-the-art forward-only algorithms. We extensively evaluate our proposed Bio-FO against other forward-only algorithms and demonstrate its performance across diverse datasets, including two real-world medical applications on wearable devices with limited resources and relatively large-scale datasets such as mini-ImageNet. At the same time, we implement our proposed on-device learning algorithm on the NVIDIA Jetson Nano and demonstrate its efficiency compared to other state-of-the-art forward-only algorithms. The code is available at https://github.com/whubaichuan/Bio-FO.
Poster
Ahmad Faraz Khan · Samuel Fountain · Ahmed Mohamed Abdelmoniem Sayed · Ali R. Butt · Ali Anwar
[ Mission City Ballroom ]
Abstract
Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model updates, and storing metadata across rounds. In addition to training, a substantial part of FL systems are the non-training workloads such as scheduling, personalization, clustering, debugging, and incentivization. Most existing systems rely on the aggregator to handle non-training workloads and use cloud services for data storage. This results in high latency and increased costs as non-training workloads rely on large volumes of metadata, including weight parameters from client updates, hyperparameters, and aggregated updates across rounds, making the situation even worse. We propose FLStore, a serverless framework for efficient FL non-training workloads and storage. FLStore unifies the data and compute planes on a serverless cache, enabling locality-aware execution via tailored caching policies to reduce latency and costs. Per our evaluations, compared to cloud object store based aggregator server FLStore reduces per request average latency by $71$% and costs by $92.45$%, with peak improvements of $99.7$% and $98.8$%, respectively. Compared to an in-memory cloud cache based aggregator server, FLStore reduces average latency by $64.6$% and costs by $98.83$%, with peak improvements of $98.8$% …
Poster
Geonhwa Jeong · Po-An Tsai · Abhimanyu Rajeshkumar Bambhaniya · Stephen Keckler · Tushar Krishna
[ Mission City Ballroom ]
Abstract
Exploiting sparsity in deep neural networks (DNNs) has been a promising area for meeting the growing computation requirements. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparsity support, but it provides limited flexibility and requires extra model fine-tuning. Moreover, any sparse model fine-tuned for certain structured sparse HW cannot be accelerated by other structured hardware. To enable acceleration using unstructured sparsity of DNNs on structured sparse hardware, we propose an approximation method leveraging the distributive property in linear algebra to turn any sparse tensor into a series of structuredsparse tensors. We also develop a software framework, TASDER, to apply high-quality structured approximation on weights and activations of DNNs. Our method accelerates dense and sparse DNNs without fine-tuning and improves energy-delay-product (EDP) by up to 83% and 74%. It achieves up to 39% speed-up on a real system.
Poster
Shu Liu · Asim Biswal · Audrey Cheng · Amog Kamsetty · Luis Gaspar Schroeder · Liana Patel · Shiyi Cao · Xiangxi Mo · Ion Stoica · Joseph Gonzalez · Matei Zaharia
[ Mission City Ballroom ]
Abstract
Batch data analytics has become a growing application for Large Language Models (LLMs). LLMs enable usersto perform a wide range of natural language tasks, such as classification, entity extraction, and translation, overlarge datasets. However, LLM inference is highly expensive in both computational and monetary costs: forexample, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a dayto handle 15 GB of data; and processing a similar amount of data costs around $10K on OpenAI’s GPT-4o. In thispaper, we propose novel techniques that can significantly reduce the cost of LLM calls for relational data analyticsworkloads. Our key contribution is developing efficient algorithms for reordering the rows and the fields witheach row of an input table to maximize key-value (KV) cache reuse when performing LLM serving. Our approachcan be easily applied to existing analytics systems and serving platforms. Evaluations show that our solution canyield up to 3.4× improvement in end-to-end latency on a benchmark of diverse LLM-based queries using Llama 3models. Our solutions also achieve 32% cost savings using OpenAI and Anthropic prefix cache pricing models.
Poster
Rui Pan · Zhuang Wang · Zhen Jia · Can Karakus · Luca Zancato · Tri Dao · Yida Wang · Ravi Netravali
[ Mission City Ballroom ]
Abstract
Hybrid models that combine the capabilities of Attention layers with the efficiency of recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1\% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.
Poster
Juechu Dong · BOYUAN FENG · Driss Guessous · Yanbo Liang · Horace He
[ Mission City Ballroom ]
Abstract
Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants --- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy compsition of attention variants, solving the "hypercube problem" of attention variants.
Poster
Zihao Ye · Lequn Chen · Ruihang Lai · Wuwei Lin · Yineng Zhang · Stephanie Wang · Tianqi Chen · Baris Kasikci · Vinod Grover · Arvind Krishnamurthy · Luis Ceze
[ Mission City Ballroom ]
Abstract
Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM applications demand flexible and high-performance attention solutions. We present FlashInfer: a customizable and efficient attention engine for LLM serving. FlashInfer tackles KV-cache storage heterogeneity using block-sparse format and composable formats to optimize memory access and reduce redundancy. It also offers a customizable attention template, enabling adaptation to various settings through Just-In-Time (JIT) compilation. Additionally, FlashInfer’s load-balanced scheduling algorithm adjusts to dynamism of user requests while maintaining compatibility with CUDAGraph which requires static configuration. FlashInfer have been integrated into leading LLM serving frameworks like SGLang, vLLM and MLC-Engine. Comprehensive kernel-level and end-to-end evaluations demonstrate FlashInfer’s ability to significantly boost kernel performance across diverse inference scenarios: compared to state-of-the-art LLM serving solutions, FlashInfer achieve 29-69% inter-token-latency reduction compared to compiler backends for LLM serving benchmark, 28-30% latency reduction for long-context inference, and 13-17% speedup for LLM serving with parallel generation.
Poster
Qianchao Zhu · Jiangfei Duan · Chang Chen · Siran Liu · Xiuhong Li · Guanyu Feng · Xin Lv · Xiao Chuanfu · Dahua Lin · Chao Yang
[ Mission City Ballroom ]
Abstract
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Exisiting sparse attention approaches employ either static sparse pattern or fixed sparsity ratio to utilize the high attention sparsity, failing to capture the adaptive sparsity ratio and dynamic sparse pattern across attention heads, input contents and model architectures. To balance accuracy and performance efficiently, we introduce a robust indicator for accuracy, Cumulative Residual Attention (CRA), which measures the percentage of attention recall.Leveraging this key insight, we present SampleAttention, which employs a novel two-stage query-guided key-value filtering approach to efficiently and dynamically select a minimal set of important column and slash strips to meet a desired CRA threshold, thus maximizing efficiency while preserving accuracy. Comprehensive evaluations show that SampleAttention can establish a new Pareto frontier in the accuracy-efficiency trade-off, and reduces TTFT by up to $5.29\times$ compared with FlashAttention2.
Poster
Anxhelo Xhebraj · Sean Lee · Hanfeng Chen · Vinod Grover
[ Mission City Ballroom ]
Abstract
We present JaxPP, a system for efficiently scaling the training of large deep learningmodels with flexible pipeline parallelism.We introduce a seamless programming model that allows implementing user-defined pipelineschedules for gradient accumulation.JaxPP automatically distributes tasks, corresponding to pipeline stages, overa cluster of nodes and automatically infers the communication among them.We implement a MPMD runtime for asynchronous execution of SPMD tasks.The pipeline parallelism implementation of JaxPP improves hardware utilization by upto $1.16\times$ with respect to the best performing SPMD configuration.
Poster
Mingkai Zheng · Zhao Zhang
[ Mission City Ballroom ]
Abstract
We present Radius, a gradient sparsity algorithm and system to accelerate large foundation model (FM) training while preserving downstream task performance.Radius leverages two key insights in large FM pre-training: 1) only a small portion of gradients contribute to the model updates in each iteration, and 2) the spatial distribution of the gradients with large magnitude is stable over time.Radius overcomes the scaling problem of existing top-k sparsity methods, as it maintains the structure of sparse gradients, which avoids dense communication in later phases of the existing top-k sparsity approaches. We examine the convergence and speed of Radius on pre-training GPT models (355M and 2.0B) in data-parallel and compare it with the existing top-$k$ sparsification method.Our results show that using the existing top-$k$ method with AdamW optimizer fails to converge, and the expected training speed improvement with sparse communication is marginal.In contrast, when pre-training GPT-2.0B model with 64 NVIDIA A100 GPUs, Radius with sparsity set to 40\%, can reduce the per-step training time by 21\% and overall pre-training time by 19\%, respectively, without degradation on the evaluation scores of the downstream tasks.
Poster
Amy Yang · Jingyi Yang · Aya Ibrahim · Xinfeng Xie · Bangsheng Tang · Grigory Sizov · Jongsoo Park · Jianyu Huang
[ Mission City Ballroom ]
Abstract
We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93\% parallelization efficiency, 63\% FLOPS utilization) and 128K context prefill in 3.8s. We develop two lossless exact ring attention variants: pass-KV and pass-Q to cover a wide range of use cases with the state-of-the-art performance: full prefill, persistent KV prefill and decode. Benchmarks on H100 GPU hosts inter-connected with RDMA and TCP both show similar scalability for long-context prefill, demonstrating that our method scales well using common commercial data center with medium-to-low inter-host bandwidth.
Poster
Yujin Wang · Shunan Dong · Zongle Huang · Yichen You · Liu He · Huazhong Yang · Yongpan Liu · Hongyang Jia
[ Mission City Ballroom ]
Abstract
Large Language Models (LLMs) are widely used in applications like conversation and text summarization. With the demand for model customization and privacy, lightweight fine-tuning methods for large models have begun to receive widespread attention. Low-Rank Adaption (LoRA) is one of the most widely used fine-tuning algorithms, which significantly reduces the tunable weights and associated optimizer memory when transferring pre-trained LLMs to downstream tasks. However, past works lacked attention to the overhead of buffered activations in low-rank adaption, leading to suboptimal system memory usage.To reduce buffered activation memory consumption and further enable the on-device memory-efficient fine-tuning system, we propose \textbf{HyC-LoRA}, a variant of the LoRA training method using a hybrid compression framework enabling almost 2-bit buffered activation quantization in all operators. HyC-LoRA observes that the temporarily buffered activation for backpropagation dominates the memory consumption in the LoRA fine-tuning process, and those in non-linear modules act as dominant memory consumers, whose quantization is more challenging. Based on this, HyC-LoRA proposes a hybrid compression mechanism with two tiers: \textbf{(1)} \textit{\textbf{Intra-operator hybrid compression}}: HyC-LoRA detects extreme outliers in buffered activation and mitigates the quantization error by structured outlier storage; \textbf{(2)} \textit{\textbf{Inter-operator hybrid compression}}: HyC-LoRA utilizes the LoRA adapter to achieve compensation for quantization errors …
Poster
Qidong Su · Wei Zhao · Xin Li · Muralidhar Andoorveedu · Chenhao Jiang · Zhanda Zhu · Kevin Song · Christina Giannoula · Gennady Pekhimenko
[ Mission City Ballroom ]
Abstract
To improve the efficiency of distributed large language model (LLM) inference, various parallelization strategies, such as tensor and pipeline parallelism, have been proposed. However, the distinct computational characteristics inherent in the two stages of LLM inference—prefilling and decoding—render a single static parallelization strategy insufficient for the effective optimization of both stages.In this work, we present Seesaw, an LLM inference engine optimized for throughput-oriented tasks. The key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies across stages, thereby maximizing throughput at both phases.To mitigate re-sharing overhead and optimize computational efficiency, we employ tiered KV cache buffering and transition-minimizing scheduling. These approaches work synergistically to reduce the overhead caused by frequent stage transitions while ensuring maximum batching efficiency.Our evaluation demonstrates that Seesaw achieves a throughput increase of up to 1.78$\times$ (1.36$\times$ on average) compared to vLLM, the most widely used state-of-the-art LLM inference engine.
Poster
Jiacheng Yang · Jun Wu · Zhen Zhang · Xinwei Fu · Zhiying Xu · Zhen Jia · Yida Wang · Gennady Pekhimenko
[ Mission City Ballroom ]
Abstract
Recent advancements in training diffusion models have made generating high-quality videos possible. Particularly, the spatial-temporal diffusion transformers (ST-DiTs) emerge as a promising diffusion model architecture for generating videos of high-resolution (1080p) and long duration (20 seconds). However, the quadratic scaling of compute cost with respect to resolution and duration, primarily due to spatial-temporal attention layers processing longer sequences, results in high inference latency of ST-DiTs. This hinders their applicability in time-sensitive scenarios. Existing sequence parallelism techniques, such as DeepSpeed-Ulysses and RingAttention, are not optimally scalable for ST-DiT inference across multiple GPU machines due to cross-machine communication overheads. To address this challenge, we introduce ScaleFusion, a scalable inference engine designed to optimize ST-DiT inference for high-resolution, long video generation. By leveraging the inherent structure of spatial-temporal attention layers, ScaleFusion effectively hides cross-machine communication overhead through novel intra-layer and inter-layer communication scheduling algorithms. This enables strong scaling of 3.60$\times$ on 4 Amazon EC2 p4d.24xlarge machines (32 A100 GPUs) against 1 machine (8 A100 GPUs). Our experiments demonstrate that ScaleFusion surpasses state-of-the-art techniques, achieving an average speedup of 1.36$\times$ (up to 1.58$\times$).
Poster
Marco Federici · Davide Belli · Mart van Baalen · Amir Jalalirad · Andrii Skliar · Bence Major · Markus Nagel · Paul Whatmough
[ Mission City Ballroom ]
Abstract
While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower.This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound.Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token.However, more recent LLMs use SwiGLU instead of ReLU, which results in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective.To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach,which preserves accuracy with minimal fine-tuning.DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe a novel cache-aware masking strategy, which considers the cache state and activation magnitude to further increase cache hit rate, improving LLM token rate on mobile devices.DIP outperforms other methods in terms of accuracy, memory and throughput trade-offs across simulated hardware settings. On Phi-3-Medium, DIP achieves a 46\% reduction in memory and 40\% increase in throughput with $<$ 0.1 loss in perplexity when compared to streaming the dense model from Flash.
Poster
Hao Kang · Srikant Bharadwaj · James Hensman · Tushar Krishna · Victor Ruehle · Saravan Rajmohan
[ Mission City Ballroom ]
Abstract
Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanisms. While techniques, such as quantization, and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operations.We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.
Poster
Yinfang Chen · Manish Shetty · Gagan Somashekar · Minghua Ma · Yogesh Simmhan · Jonathan Mace · Chetan Bansal · Rujia Wang · S R
[ Mission City Ballroom ]
Abstract
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact. While traditional DevOps tools and AIOps algorithms often focus on addressing isolated operational tasks, recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation. This paper envisions a future where AI agents autonomously manage operational tasks throughout the entire incident lifecycle, leading to self-healing cloud systems, a paradigm we term AgentOps. Realizing this vision requires a comprehensive framework to guide the design, development, and evaluation of these agents. To this end, we present AIOPSLAB, a framework that not only deploys diverse cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents. We discuss the key requirements for such a holistic framework and demonstrate how AIOPSLAB can facilitate the evaluation of next-generation AIOps agents. Through evaluations of state-of-the-art LLM agents within the benchmark created by AIOPSLAB, we provide insights into their capabilities and limitations in handling complex operational tasks in cloud environments.
Poster
Sandeep Polisetty · Juelin Liu · Yi Fung · Seung-Hwan Lim · Hui Guan · Marco Serafini
[ Mission City Ballroom ]
Abstract
Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. Data parallel approaches contain redundant work as subgraphs sampled by different GPUs contain significant overlap. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant work by splitting the sampling, loading and training of each mini-batch across multiple GPUs. Split parallelism however introduces communication overheads that can be more than the savings from removing redundant work. We further present a lightweight partitioning algorithm that probabilistically minimizes these overheads. We implement split parallelism in Spa and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and P3.
Poster
Size Zheng · Jin Fang · Xuegui Zheng · Qi Hou · Wenlei Bao · Ningxin Zheng · Ziheng Jiang · Dongyang Wang · Jianxi Ye · Haibin Lin · Li-Wen Chang · Xin Liu
[ Mission City Ballroom ]
Abstract
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone.In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17\times$ to $20.76\times$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs.
Poster
Vithursan Thangarasa · Ganesh Venkatesh · Mike Lasby · Nish Sinnadurai · Sean Lie
[ Mission City Ballroom ]
Abstract
Large language models have driven significant progress in natural language processing, but their deployment requires substantial compute and memory resources. As models scale, compression techniques become essential for balancing model quality with computational efficiency. Structured pruning, which removes less critical components of the model, is a promising strategy for reducing complexity. However, one-shot pruning often results in significant quality degradation, particularly in tasks requiring multi-step reasoning. To recover lost quality, supervised fine-tuning (SFT) is commonly applied, but it can lead to catastrophic forgetting by shifting the model's learned data distribution. Therefore, addressing the degradation from both pruning and SFT is essential to preserve the original model's quality. In this work, we utilize self-data distilled fine-tuning to address these challenges. Our approach leverages the original, unpruned model to generate a distilled dataset that preserves semantic richness and mitigates catastrophic forgetting by maintaining alignment with the base model's knowledge. Empirically, we demonstrate that self-data distillation consistently outperforms standard SFT,improving average accuracy by up to 8% on the HuggingFace OpenLLM Leaderboard v1. Specifically, when pruning six decoder blocks on Llama3.1-8B Instruct (i.e., 32 to 26 layers, reducing the model size from 8.03B to 6.72B parameters), our method retains 91.2% of the original model's …
Poster
Shulai Zhang · Ningxin Zheng · Haibin Lin · Ziheng Jiang · Wenlei Bao · Chengquan Jiang · Qi Hou · Weihao Cui · Size Zheng · Li-Wen Chang · Quan Chen · Xin Liu
[ Mission City Ballroom ]
Abstract
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal.To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.
Poster
Mohammadali Shakerdargah · Shan Lu · Chao Gao · Di Niu
[ Mission City Ballroom ]
Abstract
The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based …
Poster
Abhishek Moitra · Arkapravo Ghosh · Shrey Agrawal · Aporva Amarnath · Karthik Swaminathan · Priyadarshini Panda
[ Mission City Ballroom ]
Abstract
The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing, that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5$\times$ and 2.5 $\times$ lower decode and prefill …
Poster
Zhiqiang Xie · Hao Kang · Ying Sheng · Tushar Krishna · Kayvon Fatahalian · Christos Kozyrakis
[ Mission City Ballroom ]
Abstract
With more advanced natural language understanding and reasoning capabilities, agents powered by large language models (LLMs) are increasingly developed in simulated environments to perform complex tasks, interact with other agents, and exhibit emerging behaviors relevant to social science research and innovative gameplay development. However, current multi-agent simulations frequently suffer from inefficiencies due to the limited parallelism caused by false dependencies, resulting in a performance bottleneck. In this paper, we introduce AI Metropolis, a simulation engine that improves the efficiency of LLM agent simulations by incorporating out-of-order execution scheduling. By dynamically tracking real dependencies between agents, AI Metropolis minimizes false dependencies, enhances parallelism, and maximizes hardware utilization. Our evaluations demonstrate that AI Metropolis achieves speedups from 1.3× to 4.15× over standard parallel simulation with global synchronization, approaching optimal performance as the number of agents increases.
Poster
Tianshu Huang · Arjun Ramesh · Emily Ruppel · Nuno Pereira · Anthony Rowe · Carlee Joe-Wong
[ Mission City Ballroom ]
Abstract
Accurately estimating workload runtime is a longstanding goal in computer systems, and plays a key role in efficient resource provisioning, latency minimization, and various other system management tasks. Runtime prediction is particularly important for managing increasingly complex distributed systems in which more sophisticated processing is pushed to the edge in search of better latency. Previous approaches for runtime prediction in edge systems suffer from poor data efficiency or require intensive instrumentation; these challenges are compounded in heterogeneous edge computing environments, where historical runtime data may be sparsely available and instrumentation is often challenging. Moreover, edge computing environments often feature multi-tenancy due to limited resources at the network edge, potentially leading to interference between workloads and further complicating the runtime prediction problem. Drawing from insights across machine learning and computer systems, we design a matrix factorization-inspired method that generates accurate interference-aware predictions with tight provably-guaranteed uncertainty bounds. We validate our method on a novel WebAssembly runtime dataset collected from 24 unique devices, achieving a prediction error of 5.2\% - 2x better than a naive application of existing methods.
Poster
Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee
[ Mission City Ballroom ]
Abstract
Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).In this work, we investigate the redundancy in AdamW's learning rate adaptation rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection. The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance. Moreover, we further propose an extreme memory-efficient version, APOLLO-MINI, …
Poster
Mingyu Liang · Hiwot Kassa · Wenyin Fu · Brian Coutinho · Louis Feng · Christina Delimitrou
[ Mission City Ballroom ]
Abstract
Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model’s behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
Poster
YOUHE JIANG · Fangcheng Fu · Xiaozhe Yao · Taiyi Wang · Bin CUI · Ana Klimovic · Eiko Yoneki
[ Mission City Ballroom ]
Abstract
Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$\times$ and on average a $1.7\times$ increase in throughput and achieves up to a 2.5$\times$ and on average a $1.5\times$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.
Poster
Jiachen Liu · Fan Lai · Eric Ding · Yiwen Zhang · Mosharaf Chowdhury
[ Mission City Ballroom ]
Abstract
In recent years, collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. As the deployment of CL jobs increases, they inevitably contend for limited resources.However, efficient resource scheduling in this context is challenging because of the *ephemeral nature and resource heterogeneity of devices*, coupled with the *overlapping resource requirements of diverse CL jobs*.Existing resource managers often assign devices to CL jobs randomly for simplicity and scalability, but this approach compromises job efficiency.In this paper, we present Auxo, a CL resource manager that efficiently schedules ephemeral, heterogeneous devices among multiple CL jobs to reduce the average job completion time (JCT). Auxo formulates the *Intersection Resource Scheduling (IRS)* problem to identify complex resource contention among multiple CL jobs. It then proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic to optimize response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Auxo improves the average JCT by up to $1.88\times$. The code is available at https://github.com/SymbioticLab/Venn.
Poster
Huaifeng Zhang · Ahmed Ali-Eldin Hassan
[ Mission City Ballroom ]
Abstract
Software bloat refers to code and features that are not used by a software during runtime. For Machine Learning (ML) systems, bloat is a major contributor to their technical debt, leading to decreased performance and resource wastage. In this work, we present Negativa-ML, a novel tool to identify and remove bloat in ML frameworks by analyzing their shared libraries.Our approach includes novel techniques to detect and locate unnecessary code within GPU code - a key area overlooked by existing research.We evaluate Negativa-ML using four popular ML frameworks across ten workloads over 300 shared libraries.Our results demonstrate that ML frameworks are highly bloated on both the GPU and CPU code side, with GPU code being a primary source of bloat within ML frameworks.On average, Negativa-ML reduces the GPU code size by up to 75\% and the CPU code by up to 72\%, resulting in total file size reductions of up to 55\%.Through debloating, we achieve reductions in peak CPU memory usage, peak GPU memory usage, and execution time by up to 74.6\%, 69.6\%, and 44.6\%, respectively.
Poster
Man Tsung Yeung · Penghui Qi · Min Lin · Xinyi Wan
[ Mission City Ballroom ]
Abstract
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5\% to 51\% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios.
Poster
Wei Gao · Xinyu Zhou · Peng Sun · Tianwei Zhang · Yonggang Wen
[ Mission City Ballroom ]
Abstract
Key-Value cache (\texttt{KV} \texttt{cache}) compression has emerged as a promising technique to optimize Large Language Model (LLM) serving. It primarily decreases the memory consumption of \texttt{KV} \texttt{cache} to reduce the computation cost. Despite the development of many compression algorithms, their applications in production environments are still not prevalent. In this paper, we revisit mainstream \texttt{KV} \texttt{cache} compression solutions from a practical perspective. Our contributions are three-fold. First, we comprehensively review existing algorithmic designs and benchmark studies for \texttt{KV} \texttt{cache} compression and identify missing pieces in their performance measurement, which could hinder their adoption in practice. Second, we empirically evaluate representative \texttt{KV} \texttt{cache} compression methods to uncover two key issues that affect the computational efficiency: (1) while compressing \texttt{KV} \texttt{cache} can reduce memory consumption, current implementations (e.g., FlashAttention, PagedAttention) do not optimize for production-level LLM serving, resulting in suboptimal throughput performance; (2) compressing \texttt{KV} \texttt{cache} may lead to longer outputs, resulting in increased end-to-end latency. We further investigate the accuracy performance of individual samples rather than the overall performance, revealing the intrinsic limitations in \texttt{KV} \texttt{cache} compression when handling specific LLM tasks. Third, we provide tools to shed light on future \texttt{KV} \texttt{cache} compression studies and facilitate their practical deployment in …
Poster
Yixin Dong · Charlie Ruan · Yaxing Cai · Ziyi Xu · Yilong Zhao · Ruihang Lai · Tianqi Chen
[ Mission City Ballroom ]
Abstract
The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands.These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can more than 10x faster than existing solutions for structure generation. Combined with a LLM inference engine, it can generate near-zero overhead structure generation in low-latency inference scenarios …
Poster
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim
[ Mission City Ballroom ]
Abstract
LLMs have achieved remarkable performance across various fields, prompting data centers to use high computation cost accelerators like GPUs and NPUs for model training and inference. However, LLM’s large model sizes and related key-value (KV) caches create significant memory capacity challenges. To address this, offloading-based techniques leverage CPU memory for storing model weights and KV cache, allowing models larger than GPU memory to be served. However, these approaches often encounter performance bottlenecks due to PCIe transfer latency and fail to effectively leverage the potential of CPU computation. To address the performance limitations of existing offloading-based LLM inference in CPU and memory-limited single GPU systems, this paper proposes FlexInfer. FlexInfer uses a performance estimator to dynamically select the most appropriate execution policy for each phase—prefill and decode—based on their distinct characteristics. Our evaluation results show that by selecting optimal policies for these phases, FlexInfer can significantly reduce end-to-end latency by 75.2% and 77% on average across two different server configurations for various models such as OPT and LLaMA compared to FlexGen, the state-of-the-art offload-based LLM inference technique.
Poster
Maximilian Böther · Abe Sebastian · Pranjal Awasthi · Ana Klimovic · Srikumar Ramalingam
[ Mission City Ballroom ]
Abstract
Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset …
Poster
Xinyi Zhang · Hanyu Zhao · Wencong Xiao · Xianyan Jia · Fei Xu · Yong Li · Wei Lin · Fangming Liu
[ Mission City Ballroom ]
Abstract
The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. The combination of these strategies enables various (re-)configurable execution plans for a training job, each exhibiting remarkably different requirements across multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they rely on users to choose execution plans statically, and then allocate resources without considering the chosen plans and their resource requirements. This approach results in mismatches between execution plans and resources, making both training performance and cluster utilization far from optimal.We introduce Rubick, a cluster scheduling system for deep learning training that exploits the reconfigurability to improve job performance and cluster efficiency. Rubick incorporates the job execution planning as a new dimension in cluster scheduling, by continuously reconfiguring jobs’ execution plans and tuning multi-resource allocations across jobs jointly. Such a co-optimization is navigated by a performance model that understands the diverse resource requirements and performance characteristics of different jobs and execution plans. Rubick exploits such a model to make performance-aware scheduling decisions to maximize cluster throughput while providing performance guarantees to individual jobs. Evaluations on a 64-GPU high-performance training cluster show …
Poster
Ke Hong · Xiuhong Li · Lufang Chen · Qiuli Mao · Guohao Dai · Xuefei Ning · Shengen Yan · Yun Liang · Yu Wang
[ Mission City Ballroom ]
Abstract
Serving large language models (LLMs) efficiently requires elaborate request scheduling to satisfy service-level objectives (SLOs).In the context of LLM serving, SLOs include the constraints on Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT).Existing serving systems apply a coarse-grained request scheduling that follows a fixed principle at different iterations during the serving procedure, leading to (1) a significant distribution bias between TTFT and TPOT and (2) a significant distribution variance among different requests as shown in Fig. 1(a), and hence causes disappointing SLO attainment.We identify that fine-grained scheduling based on a formal description of the design space addresses the issues mentioned above.To this end, we first formulate a scheduling design space with flexible control of the request execution order and the workload at each iteration. Based on that, we introduce a state-aware scheduling strategy, which enables the awareness of two kinds of states: the states from the single request perspective and the states from the systemic perspective, and further balances between TTFT and TPOT and balances among different requests to improve the SLO attainment, as shown in Fig. 2. We implement SOLA with the above insights. The evaluation shows that SOLA enhances the SLO attainment from 45.5\% to 99.4\%, thus serving more requests. Given …
Poster
Xuanlin Jiang · Yang Zhou · Shiyi Cao · Ion Stoica · Minlan Yu
[ Mission City Ballroom ]
Abstract
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents.Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to makeit cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largelylimited the batch size achieved in practice, leaving significant GPU compute resources wasted.We present NEO, an online LLM inference system that offloads part of attention compute and KV cache statesfrom the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. Tothis end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPUloads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e.,code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEOachieves up to 7.5×, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3%throughput gain on A10G GPU. To facilitate future research, we open-source our code at https://github.com/NEO-MLSys25/NEO.
Poster
Daiyaan Arfeen · Zhen Zhang · Xinwei Fu · Gregory R. Ganger · Yida Wang
[ Mission City Ballroom ]
Abstract
Training Deep Neural Networks (DNNs) with billions of parameters generally involves pipeline-parallel (PP) execution. Unfortunately, PP model training can use GPUs inefficiently, especially at large scale, due to idle GPU time caused by pipeline bubbles, which are often 15-30% and can exceed 60% of the training job's GPU allocation. To improve the GPU utilization of PP model training, this paper describes PipeFill, which fills pipeline bubbles with execution of other pending jobs. By leveraging bubble GPU time, PipeFill reduces the GPU utilization sacrifice associated with scaling-up of large-model training. To context-switch between fill jobs and the main training job with minimal overhead to the main job, and maximize fill job efficiency, PipeFill carefully fits fill job work to measured bubble durations and GPU memory availability, introduces explicit pipeline-bubble instructions, and orchestrates placement and execution of fill jobs in pipeline bubbles. Experiments show that PipeFill can increase overall utilization by up to 63% for GPUs used in large-scale LLM training, with <2% slowdown of the training job, and 5-15% even for low-scale LLM training. For large-scale LLM training on 8K GPUs, the 63% increase translates to up to 2.6K additional GPUs worth of work completed.
Poster
Carlo Siebenschuh · Kyle Hippe · Ozan Gokdemir · Alexander Brace · Arham Khan · Khalid Hossain · Yadu Babuji · Nicholas Chia · Venkatram Vishwanath · Arvind Ramanathan · Rick Stevens · Ian Foster · Robert Underwood
[ Mission City Ballroom ]
Abstract
Language models for scientific tasks are trained on text from scientific publications---most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML‑driven systems (for complex or degraded ones). The choice of the ``best'' parser for a particular document depends on 1) its computational cost and 2) the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and (aligned) predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by 17$\times$ while still achieving comparable accuracy (actually, 0.2\% better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at \url{https://github.com/7shoe/AdaParse/}.
Poster
Zhiyu Mei · WEI FU · Kaiwei Li · Guangju Wang · Huanchen Zhang · Yi Wu
[ Mission City Ballroom ]
Abstract
Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for empowering large language model (LLM) applications. Compared with the supervised training process of LLMs, the RLHF training process is much more sophisticated, requiring a diverse range of computation workloads with intricate dependencies between multiple LLM instances. Therefore, simply adopting the fixed parallelization strategies from supervised training for LLMs can be insufficient for RLHF and result in low training efficiency. To overcome this limitation, we propose a novel technique named parameter ReaLlocation, which dynamically adapts the parallelization strategies for different workloads during training by redistributing LLM parameters across the training cluster. Building upon this idea, we introduce ReaL, a pioneering system for efficient RLHF training. ReaL introduces the concept of an execution plan, which defines a fine-grained resource allocation and parallelization strategy particularly designed for RLHF training. Based on this concept, ReaL employs a tailored search algorithm with a lightweight run-time estimator to automatically discover an efficient execution plan for an instance of RLHFexperiment. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaL on the LLaMA models with up to 70 billion parameters and 128 GPUs. The experimental results demonstrate that …
Poster
Md Saidul Hoque Anik · Ariful Azad
[ Mission City Ballroom ]
Abstract
Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well. An implementation of the SpTransX framework is publicly available as a Python package in https://github.com/HipGraph/SpTransX.
Poster
Jianheng Ling · Pratik Worah · Yawen Wang · Yunchuan Kong · Chunlei Wang · Clifford Stein · Diwakar Gupta · Jason Behmer · Logan Bush · Prakash Ramanan · Rajesh Kumar · Thomas Chestna · Yajing Liu · Ying Liu · Ye Zhao · Kathryn S. McKinley · Meeyoung Park · Martin Maas
[ Mission City Ballroom ]
Abstract
Scheduling virtual machines (VMs) to hosts in cloud data centers dictates efficiency and is an NP-hard problem with incomplete information. Prior work improved VM scheduling with predicted VM lifetimes. Our work further improves lifetime-aware scheduling using repredictions with lifetime distributions vs. one-shot prediction. The approach repredicts and adjusts VM and host lifetimes when incorrect predictions emerge. We also present novel approaches for defragmentation and regular system maintenance, which are essential to our data center reliability and optimizations, and are unexplored in prior work. We show that repredictions deliver a fundamental advance in effectiveness over one-shot prediction.We call our novel combination of distribution-based lifetime predictions and scheduling algorithms Lifetime Aware VM Allocation (LAVA). LAVA improves resource stranding and the number of empty hosts, which are critical for large VM scheduling, cloud system updates, and reducing dynamic energy consumption. Our approach runs in production within AnonCorp’s hyperscale cloud data centers, where it improves efficiency by decreasing stranded compute and memory resources by ~3% and ~2% respectively, and increases availability for large VMs and cloud system updates by increasing empty hosts by 2.3-9.2 pp in production. We also show a reduction in VM migrations for host defragmentation and maintenance. In addition to our …
Poster
Lorenzo Sani · Alex Iacob · Zeyu Cao · Royson Lee · Bill Marino · Yan Gao · Wanru Zhao · Dongqi Cai · Zexi Li · Xinchi Qiu · Nic Lane
[ Mission City Ballroom ]
Abstract
Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly connected GPUs or weakly connected clusters of GPUs if they can effectively be used for pre-training. Building robust low-bandwidth training systems can: (a) significantly reduce communication infrastructure costs, (b) minimize the impact of hardware failures, (c) widen the pool of usable GPUs, (d) enable collaborative training over the internet, and (e) allow dynamic compute sourcing based on factors like electricity prices. Such advancements would lessen the dependence on specialized data centers, making large-scale AI training more accessible, cost-effective, and adaptable to real-time demands. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch.We show that: (1) Photon can train model sizes up to $7$B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to …