Skip to yearly menu bar Skip to main content


Timezone: US/Pacific

Registration Desk: Registration and Check-in Thu 15 May 08:00 a.m.  


Poster: Session 9: Parallel and Distributed Systems Thu 15 May 08:30 a.m.  

Poster
Anxhelo Xhebraj · Sean Lee · Hanfeng Chen · Vinod Grover

[ Mission City Ballroom ]

Abstract
We present JaxPP, a system for efficiently scaling the training of large deep learningmodels with flexible pipeline parallelism.We introduce a seamless programming model that allows implementing user-defined pipelineschedules for gradient accumulation.JaxPP automatically distributes tasks, corresponding to pipeline stages, overa cluster of nodes and automatically infers the communication among them.We implement a MPMD runtime for asynchronous execution of SPMD tasks.The pipeline parallelism implementation of JaxPP improves hardware utilization by upto $1.16\times$ with respect to the best performing SPMD configuration.
Poster
Size Zheng · Jin Fang · Xuegui Zheng · Qi Hou · Wenlei Bao · Ningxin Zheng · Ziheng Jiang · Dongyang Wang · Jianxi Ye · Haibin Lin · Li-Wen Chang · Xin Liu

[ Mission City Ballroom ]

Abstract
Large deep learning models have achieved state-of-the-art performance in a wide range of tasks. These models often necessitate distributed systems for efficient training and inference. The fundamental building blocks for distributed model execution are intra-layer parallel operators. The most effective approach to enhancing the performance of intra-layer parallel operators involves overlapping computation with communication. The overlapping can be achieved through either operator decomposition or kernel fusion. While decomposing operators is straightforward to implement, it often results in suboptimal performance. On the other hand, fusing communication kernels with compute kernels demands significant expertise and is error-prone.In this paper, we propose TileLink to enable efficient compilation and generation of overlapped compute-communication kernels. TileLink is composed of frontend and backend. In the frontend, TileLink decouples the design space of communication and computation, linking these two parts via tile-centric primitives. In the backend, TileLink translates these primitives into low-level communication instructions, integrating the communication and computation components to achieve overlapped execution. In experiments, TileLink achieves from $1.17\times$ to $20.76\times$ speedup to non-overlapping baseline and achieves performance comparable to state-of-the-art overlapping libraries on GPUs.
Poster
Shulai Zhang · Ningxin Zheng · Haibin Lin · Ziheng Jiang · Wenlei Bao · Chengquan Jiang · Qi Hou · Weihao Cui · Size Zheng · Li-Wen Chang · Quan Chen · Xin Liu

[ Mission City Ballroom ]

Abstract
Mixture-of-experts (MoE) has been extensively employed to scale large language models to trillion-plus parameters while maintaining a fixed computational cost. The development of large MoE models in the distributed scenario encounters the problem of large communication overhead. The inter-device communication of a MoE layer can occupy 47% time of the entire model execution with popular models and frameworks. Therefore, existing methods suggest the communication in a MoE layer to be pipelined with the computation for overlapping. However, these coarse grained overlapping schemes introduce a notable impairment of computational efficiency and the latency concealing is sub-optimal.To this end, we present COMET, an optimized MoE system with fine-grained communication-computation overlapping. Leveraging data dependency analysis and task rescheduling, COMET achieves precise fine-grained overlapping of communication and computation. Through adaptive workload assignment, COMET effectively eliminates fine-grained communication bottlenecks and enhances its adaptability across various scenarios. Our evaluation shows that COMET accelerates the execution of a single MoE layer by $1.96\times$ and for end-to-end execution, COMET delivers a $1.71\times$ speedup on average. COMET has been adopted in the production environment of clusters with ten-thousand-scale of GPUs, achieving savings of millions of GPU hours.
Poster
Man Tsung Yeung · Penghui Qi · Min Lin · Xinyi Wan

[ Mission City Ballroom ]

Abstract
Pipeline parallelism is widely used to scale the training of transformer-based large language models, various works have been done to improve its throughput and memory footprint. In this paper, we address a frequently overlooked issue: the vocabulary layers can cause imbalanced computation and memory usage across pipeline stages, worsening pipeline bubbles and the memory bottleneck. To tackle this, we partition the vocabulary layers evenly across pipeline devices and group the computation into pipeline passes. To reduce the activation memory overhead, we propose several algorithms to reduce communication barriers within vocabulary layers. Additionally, we utilize a generalizable method to integrate Vocabulary Parallelism with existing pipeline schedules. By combining these techniques, our methods effectively balance the computation and parameter memory, with only a small constant activation memory overhead. Notably, when combined with activation memory-balanced schedules like V-Half, our approach achieves perfect balance in both memory and computation. Extensive evaluations demonstrate that our method achieves computation and memory balance regardless of the vocabulary size, resulting in a 5\% to 51\% improvement in throughput compared to naive approaches, meanwhile significantly reducing peak memory usage especially for large vocabulary scenarios.
Poster
Maximilian Böther · Abe Sebastian · Pranjal Awasthi · Ana Klimovic · Srikumar Ramalingam

[ Mission City Ballroom ]

Abstract
Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset …

Invited Talk: Ling Liu

Responsible Finetuning of Large Language Models

The human-like generative ability of Large Language Models (LLMs) has ushered in a new era of foundational models and generative AI, unlocking new possibilities and driving cross-domain innovations. However, the transformative potential of LLMs has been seriously challenged the problematic hallucinations of LLMs, which may lead to misinformation, biases, harmful content, making responsible finetuning of LLMs a grand challenge. Safety alignment of pretrained LLMs represents an important step forward to ensure their outputs being helpful, harmless, and honest, respecting human preferences and societal values. However, recent studies have shown that many safety-aligned LLMs suffer from security/privacy/ethic risks of user finetuning: the well-aligned LLMs can easily be broken and produce harmful, helpless or untruthful content in the presence of a small amount of harmful finetuning data. In this keynote, I will discuss some potential vulnerabilities and risks of existing safety alignment and finetuning techniques, and share some of our recent research efforts towards developing a responsible framework and techniques for more robust alignment/finetuning of LLMs.

Ling Liu

 

Ling Liu is a Professor in the School of Computer Science at Georgia Institute of Technology. She directs the research programs in the Distributed Data Intensive Systems Lab (DiSL), examining various aspects of Internet-scale big data powered artificial intelligence (AI) systems, algorithms and analytics, including performance, reliability, privacy, security and trust. Her research in the ML systems area is mainly centered on efficient AI systems and Algorithms, as well as trustworthy AI through developing AI security and AI privacy guardrails. Prof. Ling Liu’s current research is primarily supported by National Science Foundation under CISE programs, CISCO and IBM.



Poster: Session 10: LLM and Diffusion Model Serving Thu 15 May 01:15 p.m.  

Poster
Rui Pan · Zhuang Wang · Zhen Jia · Can Karakus · Luca Zancato · Tri Dao · Yida Wang · Ravi Netravali

[ Mission City Ballroom ]

Abstract
Hybrid models that combine the capabilities of Attention layers with the efficiency of recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1\% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.
Poster
Juechu Dong · BOYUAN FENG · Driss Guessous · Yanbo Liang · Horace He

[ Mission City Ballroom ]

Abstract
Over the past 7 years, attention has become one of the most important primitives in deep learning. The primary approach to optimize attention is FlashAttention, which fuses the operation together, drastically improving both the runtime and the memory consumption. However, the importance of FlashAttention combined with its monolithic nature poses a problem for researchers aiming to try new attention variants --- a "software lottery". This problem is exacerbated by the difficulty of writing efficient fused attention kernels, resisting traditional compiler-based approaches. We introduce FlexAttention, a novel compiler-driven programming model that allows implementing the majority of attention variants in a few lines of idiomatic PyTorch code. We demonstrate that many existing attention variants (e.g. Alibi, Document Masking, PagedAttention, etc.) can be implemented via FlexAttention, and that we achieve competitive performance compared to these handwritten kernels. Finally, we demonstrate how FlexAttention allows for easy compsition of attention variants, solving the "hypercube problem" of attention variants.
Poster
YOUHE JIANG · Fangcheng Fu · Xiaozhe Yao · Taiyi Wang · Bin CUI · Ana Klimovic · Eiko Yoneki

[ Mission City Ballroom ]

Abstract
Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$\times$ and on average a $1.7\times$ increase in throughput and achieves up to a 2.5$\times$ and on average a $1.5\times$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.
Poster
Yixin Dong · Charlie Ruan · Yaxing Cai · Ziyi Xu · Yilong Zhao · Ruihang Lai · Tianqi Chen

[ Mission City Ballroom ]

Abstract
The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can be parsed into code, structured function calls, and embodied agent commands.These developments bring significant demands for structured generation in LLM inference. Context-free grammar is a flexible approach to enable structured generation via constrained decoding. However, executing context-free grammar requires going through several stack states over all tokens in vocabulary during runtime, bringing non-negligible overhead for structured generation. In this paper, we propose XGrammar, a flexible and efficient structure generation engine for large language models. XGrammar accelerates context-free grammar execution by dividing the vocabulary into context-independent tokens that can be prechecked and context-dependent tokens that need to be interpreted during runtime. We further build transformations to expand the grammar context and reduce the number of context-independent tokens. Additionally, we build an efficient persistent stack to accelerate the context-dependent token checks. Finally, we co-design the grammar engine with LLM inference engine to overlap grammar computation with GPU executions. Evaluation results show that XGrammar can more than 10x faster than existing solutions for structure generation. Combined with a LLM inference engine, it can generate near-zero overhead structure generation in low-latency inference scenarios …
Poster
Xuanlin Jiang · Yang Zhou · Shiyi Cao · Ion Stoica · Minlan Yu

[ Mission City Ballroom ]

Abstract
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents.Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to makeit cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largelylimited the batch size achieved in practice, leaving significant GPU compute resources wasted.We present NEO, an online LLM inference system that offloads part of attention compute and KV cache statesfrom the GPU to the local host CPU, effectively increasing the GPU batch size and thus inference throughput. Tothis end, NEO proposes asymmetric GPU-CPU pipelining and load-aware scheduling to balance GPU and CPUloads and fully utilize their compute and memory resources. We evaluate NEO on a wide range of workloads (i.e.,code generation, text summarization), GPUs (i.e., T4, A10G, H100), and LLM models (i.e., 7B, 8B, 70B). NEOachieves up to 7.5×, 26%, and 14% higher throughput compared to GPU-only approach on T4, A10G, and H100GPUs, respectively, while maintaining the same latency; with more powerful CPUs, NEO achieves up to 79.3%throughput gain on A10G GPU. To facilitate future research, we open-source our code at https://github.com/NEO-MLSys25/NEO.

Poster: Session 11: Federated Learning Thu 15 May 02:40 p.m.  

Poster
Minxue Tang · Yitu Wang · Jingyang Zhang · Louis DiValentin · Aolin Ding · Amin Hass · Yiran Chen · Hai Li

[ Mission City Ballroom ]

Abstract
Federated Adversarial Training (FAT) can supplement robustness against adversarial examples to Federated Learning (FL), promoting a meaningful step toward trustworthy AI. However, FAT requires large models to preserve high accuracy while achieving strong robustness, incurring high memory-swapping latency when training on memory-constrained edge devices. Existing memory-efficient FL methods suffer from poor accuracy and weak robustness due to inconsistent local and global models. In this paper, we propose FedProphet, a novel FAT framework that can achieve memory efficiency, robustness, and consistency simultaneously. FedProphet reduces the memory requirement in local training while guaranteeing adversarial robustness by adversarial cascade learning with strong convexity regularization, and we show that the strong robustness also implies low inconsistency in FedProphet. We also develop a training coordinator on the server of FL, with Adaptive Perturbation Adjustment for utility-robustness balance and Differentiated Module Assignment for objective inconsistency mitigation. FedProphet significantly outperforms other baselines under different experimental settings, maintaining the accuracy and robustness of end-to-end FAT with 80% memory reduction and up to 10.8x speedup in training time.
Poster
Ahmad Faraz Khan · Samuel Fountain · Ahmed Mohamed Abdelmoniem Sayed · Ali R. Butt · Ali Anwar

[ Mission City Ballroom ]

Abstract
Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model updates, and storing metadata across rounds. In addition to training, a substantial part of FL systems are the non-training workloads such as scheduling, personalization, clustering, debugging, and incentivization. Most existing systems rely on the aggregator to handle non-training workloads and use cloud services for data storage. This results in high latency and increased costs as non-training workloads rely on large volumes of metadata, including weight parameters from client updates, hyperparameters, and aggregated updates across rounds, making the situation even worse. We propose FLStore, a serverless framework for efficient FL non-training workloads and storage. FLStore unifies the data and compute planes on a serverless cache, enabling locality-aware execution via tailored caching policies to reduce latency and costs. Per our evaluations, compared to cloud object store based aggregator server FLStore reduces per request average latency by $71$% and costs by $92.45$%, with peak improvements of $99.7$% and $98.8$%, respectively. Compared to an in-memory cloud cache based aggregator server, FLStore reduces average latency by $64.6$% and costs by $98.83$%, with peak improvements of $98.8$% …
Poster
Mohammadali Shakerdargah · Shan Lu · Chao Gao · Di Niu

[ Mission City Ballroom ]

Abstract
The advent of foundation models have revolutionized various fields, enabling unprecedented task accuracy and flexibility in computational linguistics, computer vision and other domains. Attention mechanism has become an essential component of foundation models, due to their superb capability of capturing correlations in a sequence. However, attention results in quadratic complexity in memory and compute as the context length grows. Although many fusion-based exact attention acceleration algorithms have been developed for datacenter-grade GPUs and accelerators leveraging multi-core parallelism and data locality, yet it remains a significant challenge to accelerate attention on resource-constrained edge neural accelerators with limited compute units and stringent on-chip caches. In this paper, we propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators, by parallelizing the utilization of heterogeneous compute units, i.e., vector processing units and matrix processing units. Our method involves scheduling workloads onto these different compute units in a multi-tiered tiling scheme to process tiled vector workloads and matrix workloads in attention as two streams, respecting the workload dependencies. We search for tiling factors to maximize the parallelization of both compute units while considering I/O overhead, and propose a proactive cache overwrite strategy to avoid undesirable cache spills in reality. Extensive results based …
Poster
Jiachen Liu · Fan Lai · Eric Ding · Yiwen Zhang · Mosharaf Chowdhury

[ Mission City Ballroom ]

Abstract
In recent years, collaborative learning (CL) has emerged as a promising approach for machine learning (ML) and data science across distributed edge devices. As the deployment of CL jobs increases, they inevitably contend for limited resources.However, efficient resource scheduling in this context is challenging because of the *ephemeral nature and resource heterogeneity of devices*, coupled with the *overlapping resource requirements of diverse CL jobs*.Existing resource managers often assign devices to CL jobs randomly for simplicity and scalability, but this approach compromises job efficiency.In this paper, we present Auxo, a CL resource manager that efficiently schedules ephemeral, heterogeneous devices among multiple CL jobs to reduce the average job completion time (JCT). Auxo formulates the *Intersection Resource Scheduling (IRS)* problem to identify complex resource contention among multiple CL jobs. It then proposes a contention-aware scheduling heuristic to minimize the average scheduling delay. Furthermore, it proposes a resource-aware device-to-job matching heuristic to optimize response collection time by mitigating stragglers. Our evaluation shows that, compared to the state-of-the-art CL resource managers, Auxo improves the average JCT by up to $1.88\times$. The code is available at https://github.com/SymbioticLab/Venn.
Poster
Lorenzo Sani · Alex Iacob · Zeyu Cao · Royson Lee · Bill Marino · Yan Gao · Wanru Zhao · Dongqi Cai · Zexi Li · Xinchi Qiu · Nic Lane

[ Mission City Ballroom ]

Abstract
Scaling large language models (LLMs) demands extensive data and computing resources, which are traditionally constrained to data centers by the high-bandwidth requirements of distributed training. Low-bandwidth methods like federated learning (FL) could enable collaborative training of larger models across weakly connected GPUs or weakly connected clusters of GPUs if they can effectively be used for pre-training. Building robust low-bandwidth training systems can: (a) significantly reduce communication infrastructure costs, (b) minimize the impact of hardware failures, (c) widen the pool of usable GPUs, (d) enable collaborative training over the internet, and (e) allow dynamic compute sourcing based on factors like electricity prices. Such advancements would lessen the dependence on specialized data centers, making large-scale AI training more accessible, cost-effective, and adaptable to real-time demands. To achieve this, we introduce Photon, the first complete system for federated end-to-end LLM training, leveraging cross-silo FL for global-scale training with minimal communication overheads. Using Photon, we train the first federated family of decoder-only LLMs from scratch.We show that: (1) Photon can train model sizes up to $7$B in a federated fashion while reaching an even better perplexity than centralized pre-training; (2) Photon model training time decreases with available compute, achieving a similar compute-time trade-off to …

Poster: Session 12: Edge and Cloud Systems Thu 15 May 04:30 p.m.  

Poster
Yue Gao · Ilia Shumailov · Kassem Fawaz

[ Mission City Ballroom ]

Abstract
Machine learning (ML) systems are increasingly vulnerable to supply-chain attacks that exploit the intricate dependencies inherent in open-source software (OSS). However, securing the ML ecosystem remains challenging due to regular paradigmatic changes in the ecosystem, their dynamic runtime environments, and lack of security awareness in open-source ML projects. In this paper, we introduce a novel class of supply-chain attacks that specifically target ML models, relying on inherent insecurity of Python as a programming language. Such attacks leverage traditional supply-chain vulnerabilities to inject innocuous-looking code that weakens the ML model's robustness. We then conduct an LLM-assisted analysis of discussions from the top 50 ML projects on GitHub to understand the current state of supply-chain security awareness among contributors. Despite the need for a higher standard of security practices, our findings reveal a similar level of security awareness between the ML and non-ML communities, highlighting the need for enhanced safeguards against ML-specific supply-chain attacks.
Poster
Chendong Wang · Anlan Zhang · Yifan Yang · Lili Qiu · Yuqing Yang · XINYANG JIANG · Feng Qian · Suman Banerjee

[ Mission City Ballroom ]

Abstract
3D volumetric video provides immersive experience and is gaining traction in digital media. Despite its rising popularity, the streaming of volumetric video content poses significant challenges due to the high data bandwidth requirement. A natural approach to mitigate the bandwidth issue is to reduce the volumetric video's data rate by downsampling the content prior to transmission. The video can then be upsampled at the receiver's end using a super-resolution (SR) algorithm to reconstruct the high-resolution details. While super-resolution techniques have been extensively explored and advanced for 2D video content, there is limited work on SR algorithms tailored for volumetric videos. To address this gap and the growing need for efficient volumetric video streaming, we have developed VoLUT with a new SR algorithm specifically designed for volumetric content. Our algorithm uniquely harnesses the power of lookup tables (LUTs) to facilitate the efficient and accurate upscaling of low-resolution volumetric data. The use of LUTs enables our algorithm to quickly reference precomputed high-resolution values, thereby significantly reducing the computational complexity and time required for upscaling. We further apply adaptive video bit rate algorithm (ABR) to dynamically determine the downsampling rate according to the network condition and stream the selected video rate to the …
Poster
Zichao Yue · Chenhui Deng · Zhiru Zhang

[ Mission City Ballroom ]

Abstract
Graph neural networks (GNNs) are widely used for learning node embeddings in graphs, typically adopting a message-passing scheme. This approach, however, leads to the neighbor explosion problem, with exponentially growing computational and memory demands as layers increase. Graph sampling has become the predominant method for scaling GNNs to large graphs, mitigating but not fully solving the issue.Pre-propagation GNNs (PP-GNNs) represent a new class of models that decouple feature propagation from training through pre-processing, addressing neighbor explosion in theory. Yet, their practical advantages and system-level optimizations remain underexplored.This paper provides a comprehensive characterization of PP-GNNs, comparing them with graph-sampling-based methods in training efficiency, scalability, and accuracy. While PP-GNNs achieve comparable accuracy, we identify data loading as the key bottleneck for training efficiency and input expansion as a major scalability challenge. To address these issues, we propose optimized data loading schemes and tailored training methods that improve PP-GNN training throughput by an average of 15$\times$ over the PP-GNN baselines, with speedup of up to 2 orders of magnitude compared to sampling-based GNNs on large graph benchmarks. Our implementation is publicly available at https://github.com/cornell-zhang/preprop-gnn.
Poster
Abhishek Moitra · Arkapravo Ghosh · Shrey Agrawal · Aporva Amarnath · Karthik Swaminathan · Priyadarshini Panda

[ Mission City Ballroom ]

Abstract
The computational and memory challenges of large language models (LLMs) have sparked several optimization approaches towards their efficient implementation. While prior LLM-targeted quantization, and prior works on sparse acceleration have significantly mitigated the memory and computation bottleneck, they do so assuming high power platforms such as GPUs and server-class FPGAs with large off-chip memory bandwidths and employ a generalized matrix multiplication (GEMM) execution of all the layers in the decoder. In such a GEMM-based execution, data is fetched from an off-chip memory, computed and stored back. However, at reduced off-chip memory capacities, as is the case with low-power edge devices, this implementation strategy significantly increases the attention computation latency owing to the repeated storage and fetch of large intermediate tokens to and from the off-chip memory. Moreover, fetching the weight matrices from a bandwidth constrained memory further aggravates the memory bottleneck problem. To this end, we introduce MEADOW, a framework that significantly reduces the off-chip memory access for LLMs with a novel token-parallel head-sequential (TPHS) dataflow. Additionally, MEADOW applies weight packing, that performs loss-less decomposition of large weight matrices to their unique elements thereby, reducing the enormous weight fetch latency. MEADOW demonstrates 1.5$\times$ and 2.5 $\times$ lower decode and prefill …
Poster
Jianheng Ling · Pratik Worah · Yawen Wang · Yunchuan Kong · Chunlei Wang · Clifford Stein · Diwakar Gupta · Jason Behmer · Logan Bush · Prakash Ramanan · Rajesh Kumar · Thomas Chestna · Yajing Liu · Ying Liu · Ye Zhao · Kathryn S. McKinley · Meeyoung Park · Martin Maas

[ Mission City Ballroom ]

Abstract
Scheduling virtual machines (VMs) to hosts in cloud data centers dictates efficiency and is an NP-hard problem with incomplete information. Prior work improved VM scheduling with predicted VM lifetimes. Our work further improves lifetime-aware scheduling using repredictions with lifetime distributions vs. one-shot prediction. The approach repredicts and adjusts VM and host lifetimes when incorrect predictions emerge. We also present novel approaches for defragmentation and regular system maintenance, which are essential to our data center reliability and optimizations, and are unexplored in prior work. We show that repredictions deliver a fundamental advance in effectiveness over one-shot prediction.We call our novel combination of distribution-based lifetime predictions and scheduling algorithms Lifetime Aware VM Allocation (LAVA). LAVA improves resource stranding and the number of empty hosts, which are critical for large VM scheduling, cloud system updates, and reducing dynamic energy consumption. Our approach runs in production within AnonCorp’s hyperscale cloud data centers, where it improves efficiency by decreasing stranded compute and memory resources by ~3% and ~2% respectively, and increases availability for large VMs and cloud system updates by increasing empty hosts by 2.3-9.2 pp in production. We also show a reduction in VM migrations for host defragmentation and maintenance. In addition to our …

Closing Remarks Thu 15 May 06:00 p.m.