Track: Session 6: Edge and Cloud Systems

#10

SwiftVI: Time-Efficient Planning and Learning with MDPs

Kasper Overgaard Mortensen · Konstantinos Skitsas · Emil Morre Christensen · Mohammad Sadegh Talebi · Andreas Pavlogiannis · Davide Mottin · Panagiotis Karras

Markov decision process (MDPs) find application wherever a decision-making agent acts and learns in an uncertain environment from facility management to healthcare and service provisioning. However, finding the optimal policy such an agent should follow raises high computational cost, calling for solutions that scale to large numbers of actions and states? In this paper, we propose SwiftVI, a suite of algorithms that solve MDPs scalably by organizing the set of actions for each state in priority queues and deriving bounds for backup Q-values. Our championed solution prunes the set of actions at each state utilizing a tight upper bound and a single priority queue. A thorough experimental study confirms that SwiftVI algorithms achieve high efficiency gains robustly to model parameters.

#12

ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud

Lu Wang · Mayukh Das · Fangkai Yang · Bo Qiao · Hang Dong · Si Qin · Victor Ruehle · Chetan Bansal · Eli Cortez · Íñigo Goiri · S R · Qingwei Lin · Dongmei Zhang

Safe optimization of operating costs is one of the holy grails of successful revenue-generating cloud systems and capacity/resource efficiency is a key factor in making that a reality. Among other strategies for resource efficiency across major cloud providers, Oversubscription is an extremely prevalent practice where more virtual resources are offered than actual physical capacity to minimize revenue loss against redundant capacity. While resources can be of any type, including compute, memory, power or network bandwidth, we highlight the scenario of virtual CPU (vCPU) oversubscription since vCPU cores are primarily the billable units for cloud services and has substantial impact on business as well as users. For a seamless cloud experience, while being cost-efficient for the provider, suitable policies for controlling oversubscription margins are crucial. Narrow margins lead to redundant expenditure on under-utilized resource capacity, and wider margins lead to under-provisioning where customer workloads may suffer from resource contention. Most oversubscription policies today are engineered either with tribal knowledge or with static heuristics about the system, which lead to catastrophic overloading or stranded/under-utilized resources. Smart oversubscription policies that can adapt to demand/utilization patterns across time and granularity to jointly optimize cost benefits and risks is a non-trivial, largely, unsolved problem. We address this challenge with our proposed novel Prototypical Risk-cognizant Active Imitation Learning (ProtoRAIL) framework that exploits approximate symmetries in utilization patterns to learn suitable policies. The active knowledge-in-the-loop (KITL) module de-risks the learned policies. Our empirical investigations and real deployments on Microsoft's internal (1$^{st}$ party) cloud service, show orders of magnitude reduction ($\approx \geq 90\times$) in risk and significant increase in benefits (saved stranded resources: in a range of $\approx 7 -10 $\%).

#18

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

Chenxi Yang · Yan Li · Martin Maas · Mustafa Uysal · Ubaid Hafeez · Arif Merchant · Richard McDougall

Storage systems account for a major portion of the total cost of ownership (TCO) of warehouse-scale computers, and thus have a major impact on the overall system's efficiency. Machine learning (ML)-based methods for solving key problems in storage system efficiency, such as data placement, have shown significant promise. However, there are few known practical deployments of such methods. Studying this problem in the context of real-world hyperscale data center deployments at $AnonCorp$, we identify a number of challenges that we believe cause this lack of practical adoption. Specifically, prior work assumes a monolithic model that resides entirely within the storage layer, an unrealistic assumption in real-world data center deployments. We propose a cross-layer approach that moves ML out of the storage system and performs it in the application running on top of it, co-designed with a scheduling algorithm at the storage layer that consumes predictions from these application-level models. This approach combines small, interpretable models with a co-designed heuristic that adapts to different online environments. We build a proof-of-concept of this approach in a production distributed computation framework at $AnonCorp$. Evaluations in a test deployment and large-scale simulation studies using production traces show improvements of as much as 3.47$\times$ in TCO savings compared to state of the art baselines. We believe this work represents a significant step towards more practical ML-driven storage placement in warehouse-scale computers.

#25

Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm

Baichuan Huang · Amir Aminifar

The training of the state-of-the-art Deep Neural Networks (DNNs) consumes massive amounts of energy, while the human brain learns new tasks with remarkable efficiency. Currently, the training of DNNs relies almost exclusively on Backpropagation (BP). However, BP faces criticism due to its biologically implausible nature, underscoring the significant disparity in performance and energy efficiency between DNNs and the human brain. Forward-only algorithms are proposed to be the biologically plausible alternatives to BP, to better mimic the learning process of the human brain and enhance energy efficiency. In this paper, we propose a biologically-plausible forward-only algorithm (Bio-FO), not only targeting the biological-implausibility issues associated with BP, but also outperforming the state-of-the-art forward-only algorithms. We extensively evaluate our proposed Bio-FO against other forward-only algorithms and demonstrate its performance across diverse datasets, including two real-world medical applications on wearable devices with limited resources and relatively large-scale datasets such as mini-ImageNet. At the same time, we implement our proposed on-device learning algorithm on the NVIDIA Jetson Nano and demonstrate its efficiency compared to other state-of-the-art forward-only algorithms. The code is available at https://github.com/whubaichuan/Bio-FO.

#28

Optimizing LLM Queries in Relational Data Analytics Workloads

Shu Liu · Asim Biswal · Audrey Cheng · Amog Kamsetty · Luis Gaspar Schroeder · Liana Patel · Shiyi Cao · Xiangxi Mo · Ion Stoica · Joseph Gonzalez · Matei Zaharia

Batch data analytics has become a growing application for Large Language Models (LLMs). LLMs enable usersto perform a wide range of natural language tasks, such as classification, entity extraction, and translation, overlarge datasets. However, LLM inference is highly expensive in both computational and monetary costs: forexample, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a dayto handle 15 GB of data; and processing a similar amount of data costs around $10K on OpenAI’s GPT-4o. In thispaper, we propose novel techniques that can significantly reduce the cost of LLM calls for relational data analyticsworkloads. Our key contribution is developing efficient algorithms for reordering the rows and the fields witheach row of an input table to maximize key-value (KV) cache reuse when performing LLM serving. Our approachcan be easily applied to existing analytics systems and serving platforms. Evaluations show that our solution canyield up to 3.4× improvement in end-to-end latency on a benchmark of diverse LLM-based queries using Llama 3models. Our solutions also achieve 32% cost savings using OpenAI and Anthropic prefix cache pricing models.