MLSys 2024 Thursday 05/16

Timezone: US/Pacific

Schedule Sun Mon Tue Wed Thu

Registration Desk: Registration Check-in Desk Thu 16 May 07:00 a.m.

Poster: Performance and Memory Thu 16 May 09:00 a.m.

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Poster

Size Zheng · Renze Chen · Meng Li · Zihao Ye · Luis Ceze · Yun Liang

[ Poster Position Number ]

Abstract

IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN).However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs.Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption.In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers.Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU.Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} …

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Poster

Zhixu Du · Shiyu Li · Yuhao Wu · Xiangyu Jiang · Jingwei Sun · Qilin Zheng · Yongkai Wu · Ang Li · Hai Li · Yiran Chen

[ Poster Position Number ]

Abstract

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to 3.93x throughput increasing, up to 72% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Poster

Pratik Fegade · Tianqi Chen · Phillip Gibbons · Todd Mowry

[ Poster Position Number ]

Abstract

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. However, the resulting control flow divergence makes batching, an important performance optimization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5 better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce RTX 3070 GPU.

Invited Talk: Jeff Dean

Exciting Directions in Systems for Machine Learning

Bio :

Jeff Dean

Jeff is the Chief Scientist for Google Research and Google DeepMind, and co-leads the Gemini project. He has worked for many years at the intersection of computer systems for machine learning, including work in ML accelerators, low-level software and frameworks for machine learning, work on sparse model architectures, algorithms like distillation and neural architecture search, training of large language and multimodal models, and applications of machine learning to areas like ASIC design, healthcare, and translation. He is a recipient of the ACM Prize in Computing, the IEEE John von Neumann medal, the Mark Weiser Award, & best paper awards at NeurIPS, OSDI, OOPSLA, PLDI, SOSP, & MLSys. He is a Fellow of the ACM, and a member of the US National Academy of Engineering, & the AAAS.

Poster: Measurement and Analysis Thu 16 May 01:30 p.m.

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Poster

Yifei Xu · Yuning Chen · Xumiao Zhang · Xianshang Lin · Pan Hu · Yunfei Ma · Songwu Lu · Wan Du · Zhuoqing Mao · Ennan Zhai · Dennis Cai

[ Poster Position Number ]

Abstract

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

Does Compressing Activations Help Model Parallel Training?

Poster

Song Bian · Dacheng Li · Hongyi Wang · Eric Xing · Shivaram Venkataraman

[ Poster Position Number ]

Abstract

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of \underline{M}odel Parallelism \underline{C}ompression, we present a benchmarking framework \textbf{MCBench}. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically …

COMET: Neural Cost Model Explanation Framework

Poster

Isha Chaudhary · Alex Renda · Charith Mendis · Gagandeep Singh

[ Poster Position Number ]

Abstract

Cost models predict the cost of executing given assembly code basic blocks on a specific microarchitecture. Recently, neural cost models have been shown to be fairly accurate and easy to construct. They can replace heavily engineered analytical cost models used in mainstream compiler workflows. However, their black-box nature discourages their adoption. In this work, we develop the first framework, COMET, for generating faithful, generalizable, and intuitive explanations for neural cost models. We generate and compare COMET’s explanations for the popular neural cost model, Ithemal against those for an accurate CPU simulation-based cost model, uiCA. Our empirical findings show an inverse correlation between the prediction errors of Ithemal and uiCA and the granularity of basic block features in COMET’s explanations for them, thus indicating potential reasons for the higher error of Ithemal with respect to uiCA.

VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE

Poster

Amey Agrawal · Nitin Kedia · Jayashree Mohan · Ashish Panwar · Nipun Kwatra · Bhargav Gulavani · Ramachandran Ramjee · Alexey Tumanov

[ Poster Position Number ]

Abstract

Large language models (LLMs) are widely used in various domains for their ability to perform tasks that requirehuman-like skills. However, LLM inference is expensive today. Furthermore, optimizing LLM inference ischallenging, as its performance depends on many configuration options such as model parallelization strategy, thebatching algorithm, scheduling policy, maximum batch size allowed, etc. Identifying the optimal configuration fora large-scale cluster by experimentally running hundreds of configuration combinations is impractical due to theexorbitant time and monetary cost involved. To tackle this challenge, we present VIDUR and VIDUR-BENCH,the first large-scale, high-fidelity, collaborative, and easily extensible simulation framework for LLM inferencealongside a benchmark suite. VIDUR carefully models the performance of various operators involved in LLMinference using a combination of experimental profiling and predictive modeling, and evaluates the end-to-endmodel inference performance for different workloads by estimating several key performance metrics such aslatency, throughput, and time-to-first-byte. We experimentally validate our simulator on several LLMs and showthat it can estimate metrics such as inference latency and throughput with less than 5% error rate. VIDUR alsohelps answer large-scale deployment related what-if questions such as what is the best tensor-parallel dimension tomaximize serving throughput of the LlaMa-7B model across 32 A100 GPUs? We will open-source the simulatorcode, along with …

Poster: ML for Systems Thu 16 May 03:30 p.m.

On Latency Predictors for Neural Architecture Search

Poster

Yash Akhauri · Mohamed Abdelfattah

[ Poster Position Number ]

Abstract

Efficient deployment of neural networks (NN) requires the co-optimization of accuracy and latency. For example, hardware-aware neural architecture search has been used to automatically find NN architectures that satisfy a latency constraint on a specific hardware device. Central to these search algorithms is a prediction model that is designed to provide a hardware latency estimate for a candidate NN architecture. Recent research has shown that the sample efficiency of these predictive models can be greatly improved through pre-training on some training devices with many samples, and then transferring the predictor on the test (target) device.Transfer learning and meta-learning methods have been used for this, but often exhibit significant performance variability.Additionally, the evaluation of existing latency predictors has been largely done on hand-crafted training/test device sets, making it difficult to ascertain design features that compose a robust and general latency predictor. To address these issues, we introduce a comprehensive suite of latency prediction tasks obtained in a principled way through automated partitioning of hardware device sets.We then design a general latency predictor to comprehensively study (1) the predictor architecture, (2) NN sample selection methods, (3) hardware device representations, and (4) NN operation encoding schemes.Building on conclusions from our study, we present …

FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms

Poster

Haoran Qiu · Weichao Mao · Archit Patke · Shengkun Cui · Chen Wang · Hubertus Franke · Zbigniew Kalbarczyk · Tamer Basar · Ravi Iyer

[ Poster Position Number ]

Abstract

The emergence of ML in various cloud system management tasks (e.g., workload autoscaling and job scheduling) has become a core driver of ML-centric cloud platforms. However, there are still numerous algorithmic and systems challenges that prevent ML-centric cloud platforms from being production-ready. In this paper, we focus on the challenges of model performance variability and costly model retraining, introduced by dynamic workload patterns and heterogeneous applications and infrastructures in cloud environments. To address these challenges, we present FLASH, an extensible framework for fast model adaptation in ML-based system management tasks. We show how FLASH leverages existing ML agents and their training data to learn to generalize across applications/environments with meta-learning. FLASH can be easily integrated with an existing ML-based system management agent with a unified API. We demonstrate the use of FLASH by implementing three existing ML agents that manage (1) resource configurations, (2) autoscaling, and (3) server power. Our experiments show that FLASH enables fast adaptation to new, previously unseen applications/environments (e.g., 5.5x faster than transfer learning in the autoscaling task), indicating significant potential for adopting ML-centric cloud platforms in production.

VQPy: An Object-Oriented Approach to Modern Video Analytics

Poster

Shan Yu · Zhenting Zhu · Yu Chen · Hanchen Xu · Pengzhan Zhao · Yang Wang · Arthi Padmanabhan · Hugo Latapie · Harry Xu

[ Poster Position Number ]

Abstract

Video analytics is widely used in contemporary systems and services. At the forefront of video analytics are video queries that users develop to find objects of particular interest. Building upon the insight that video objects (e.g., human, animals, cars, etc.), the center of video analytics, are similar in spirit to objects modeled by traditional object-oriented languages, we propose to develop an object-oriented approach to video analytics. This approach, named VQPy, consists of a front-end— a Python variant with constructs that make it easy for users to express video objects and their interactions—as well as an extensible backend that can automatically construct and optimize pipelines based on video objects. We have implemented and open-sourced VQPy, which is currently used in a major tech company as part of their DeepVision framework.

UniDM: A Unified Framework for Data Manipulation with Large Language Models

Poster

Yichen Qian · Yongyi He · Rong Zhu · Jintao Huang · Zhijian Ma · Haibin Wang · Yaohua Wang · Xiuyu Sun · Defu Lian · Bolin Ding · Jingren Zhou

[ Poster Position Number ]

Abstract

Designing effective data manipulation methods is a long standing problem in data lakes. Traditional methods, which rely on rules or machine learning models, require extensive human efforts on training data collection and tuning models. Recent methods apply Large Language Models (LLMs) to resolve multiple data manipulation tasks. They exhibit bright benefits in terms of performance but still require customized designs to fit each specific task. This is very costly and can not catch up with the requirements of big data lake platforms. In this paper, inspired by the cross-task generality of LLMs on NLP tasks, we pave the first step to design an automatic and general solution to tackle with data manipulation tasks. We propose UniDM, a unified framework which establishes a new paradigm to process data manipulation tasks using LLMs. UniDM formalizes a number of data manipulation tasks in a unified form and abstracts three main general steps to solve each task. We develop an automatic context retrieval to allow the LLMs to retrieve data from data lakes, potentially containing evidence and factual information. For each step, we design effective prompts to guide LLMs to produce high quality results. By our comprehensive evaluation on a variety of benchmarks, our …

Main Navigation

Registration Desk: Registration Check-in Desk Thu 16 May 07:00 a.m.

Poster: Performance and Memory Thu 16 May 09:00 a.m.

Invited Talk: Jeff Dean

Poster: Measurement and Analysis Thu 16 May 01:30 p.m.

Poster: ML for Systems Thu 16 May 03:30 p.m.

Closing Remarks Thu 16 May 05:00 p.m.