Skip to yearly menu bar Skip to main content


Session

Measurement and Analysis

Mission B4 & B6
Thu 16 May 1:30 p.m. PDT — 3 p.m. PDT
Abstract:
Chat is not available.

Thu 16 May 13:30 - 13:50 PDT

2
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Yifei Xu · Yuning Chen · Xumiao Zhang · Xianshang Lin · Pan Hu · Yunfei Ma · Songwu Lu · Wan Du · Zhuoqing Mao · Ennan Zhai · Dennis Cai

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

Thu 16 May 13:50 - 14:10 PDT

35
Does Compressing Activations Help Model Parallel Training?

Song Bian · Dacheng Li · Hongyi Wang · Eric Xing · Shivaram Venkataraman

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of \underline{M}odel Parallelism \underline{C}ompression, we present a benchmarking framework \textbf{MCBench}. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically for training with MP compression. The insights derived from our findings can help direct the future development of new MP compression algorithms for distributed training.

Thu 16 May 14:20 - 14:40 PDT

17
COMET: Neural Cost Model Explanation Framework

Isha Chaudhary · Alex Renda · Charith Mendis · Gagandeep Singh

Cost models predict the cost of executing given assembly code basic blocks on a specific microarchitecture. Recently, neural cost models have been shown to be fairly accurate and easy to construct. They can replace heavily engineered analytical cost models used in mainstream compiler workflows. However, their black-box nature discourages their adoption. In this work, we develop the first framework, COMET, for generating faithful, generalizable, and intuitive explanations for neural cost models. We generate and compare COMET’s explanations for the popular neural cost model, Ithemal against those for an accurate CPU simulation-based cost model, uiCA. Our empirical findings show an inverse correlation between the prediction errors of Ithemal and uiCA and the granularity of basic block features in COMET’s explanations for them, thus indicating potential reasons for the higher error of Ithemal with respect to uiCA.

Thu 16 May 14:40 - 15:00 PDT

1
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE

Amey Agrawal · Nitin Kedia · Jayashree Mohan · Ashish Panwar · Nipun Kwatra · Bhargav Gulavani · Ramachandran Ramjee · Alexey Tumanov

Large language models (LLMs) are widely used in various domains for their ability to perform tasks that requirehuman-like skills. However, LLM inference is expensive today. Furthermore, optimizing LLM inference ischallenging, as its performance depends on many configuration options such as model parallelization strategy, thebatching algorithm, scheduling policy, maximum batch size allowed, etc. Identifying the optimal configuration fora large-scale cluster by experimentally running hundreds of configuration combinations is impractical due to theexorbitant time and monetary cost involved. To tackle this challenge, we present VIDUR and VIDUR-BENCH,the first large-scale, high-fidelity, collaborative, and easily extensible simulation framework for LLM inferencealongside a benchmark suite. VIDUR carefully models the performance of various operators involved in LLMinference using a combination of experimental profiling and predictive modeling, and evaluates the end-to-endmodel inference performance for different workloads by estimating several key performance metrics such aslatency, throughput, and time-to-first-byte. We experimentally validate our simulator on several LLMs and showthat it can estimate metrics such as inference latency and throughput with less than 5% error rate. VIDUR alsohelps answer large-scale deployment related what-if questions such as what is the best tensor-parallel dimension tomaximize serving throughput of the LlaMa-7B model across 32 A100 GPUs? We will open-source the simulatorcode, along with the workload benchmark suite, so that researchers and practitioners can collaboratively exploremodel and systems optimizations for efficient deployment of LLMs.