Track: Research Track Oral Presentation: LLM Serving 2

Wed 20 May 8:30 - 8:45 PDT

BEAM: Joint Resource–Power Optimization for Energy-Efficient LLM Inference under SLO contraints

Hyunjae Lee ⋅ Sangjin Choi ⋅ Seungjae Lim ⋅ Youngjin Kwon

Large Language Model (LLM) serving is rapidly becoming one of the most power-intensive workloads in modern datacenters. Unlike training, where throughput dominates, inference must satisfy strict per-request latency targets such as Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT). Once an SLO is met, the remaining latency slack between the earliest possible completion and the deadline offers an opportunity for energy savings. Existing systems, however, exploit only one dimension of this trade-off: batching improves resource efficiency, while DVFS improves power efficiency. These two axes are tightly coupled, and optimizing one while fixing the other yields only a local optimum. We present BEAM, a fine-grained controller that dynamically co-optimizes resource and power efficiency under per-request SLOs. BEAM continuously allocates the available latency slack across both dimensions by jointly tuning GPU frequency, chunk size, and microbatch count in real time. Its event-driven design responds instantly to request arrivals and completions, while a lightweight predictive model enables sub-millisecond decision making with negligible overhead. Implemented atop the vLLM runtime, BEAM reduces end-to-end GPU energy consumption by up to 51\% compared to vLLM.

Wed 20 May 8:45 - 9:00 PDT

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Zhaoyuan Su ⋅ Zeyu Zhang ⋅ Tingfeng Lan ⋅ Zirui Wang ⋅ Haiying Shen ⋅ Juncheng Yang ⋅ Yue Cheng

Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. To deal with these issues, we present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which repurposes the freed memory to dynamically expand KV cache capacity. These mechanisms enable state-preserving transitions that jointly coordinate weight precision and KV capacity at runtime. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45% and improves P95 TTFT by 2.2×–3.9× over full-precision serving, without compromising generation quality. Compared to planning-based quantization methods, MorphServe reduces average accuracy degradation by 41.3%, and lowers P95 TTFT by up to 2.4× over KV cache compression while maintaining higher generation quality. These results establish MorphServe as a practical and elastic solution that effectively navigates the accuracy–efficiency Pareto frontier under dynamic LLM serving workloads.

Wed 20 May 9:00 - 9:15 PDT

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

YOUHE JIANG ⋅ Fangcheng Fu ⋅ Eiko Yoneki

The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (\textbf{\underline{i}}) An \textit{algorithmic} perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (\textbf{\underline{ii}}) a \textit{systems} perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (\textbf{\underline{i}}) Determining optimal query routing strategies under latency and quality requirements, (\textbf{\underline{ii}}) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (\textbf{\underline{iii}}) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a \textit{quality-aware scheduling system} that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a \textit{multi-objective Bayesian optimization (MOBO) framework} to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that \sys outperforms state-of-the-art LLM serving systems by up to 157\% and 59\% on average under \textit{identical} cost budgets and quality requirements, or reducing serving costs by 15\%-61\% (38\% on average) while maintaining the \textit{same} performance targets, validating its effectiveness in achieving cost-efficient LLM serving.

Wed 20 May 9:15 - 9:30 PDT

Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU

Hsing-Ti Wang ⋅ Hung-Tso Shiao ⋅ Chia-Lin Yang

Large Language Models (LLMs) are central to modern NLP applications, yet their deployment on consumer-grade GPUs is limited by limited memory capacity and bandwidth. In typical single-batch inference on local devices, the key–value (KV) cache occupies only a small fraction of total memory, so prior studies have largely focused on model weights. The rise of test-time compute (TTC), however, introduces a new bottleneck: the rapidly expanding KV cache. In TTC methods such as step-wise beam search, concurrent decoding paths cause KV cache size and transfer costs to scale with exploration space, resulting in severe I/O stalls on consumer-grade GPUs. We identify two complementary forms of data locality in TTC workloads. Inter-token locality occurs within each decoding step, as consecutive tokens in the same beam access nearly identical KV cache data. Inter-beam locality arises across decoding steps, as beams that share common prefixes reuse overlapping KV segments. Building on these observations, we propose Locality-Aware Beam Scheduling, which exploits these locality patterns to reduce redundant KV cache transfers. It also employs balanced grouping with prefetching to overlap data movement with computation. Evaluated on OPT-6.7B, LLaMA-2-7B, and Qwen-7B, our method reduces KV cache transfer volume by over 95\% and achieves consistent end-to-end speedups of 3.39×–9.72×, 3.60×–8.74×, and 4.17×–7.99×, respectively, compared to layer-wise offloading.

Wed 20 May 9:30 - 9:45 PDT

Breaking the Ice: Analyzing Cold Start Latency in vLLM

Huzaifa Shaaban Kabakibo ⋅ Animesh Trivedi ⋅ Lin Wang

As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine. With major architectural innovations under it (e.g., the V1 API, introduction of torch.compile), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM's startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All our benchmarking datasets, analysis tools, and prediction scripts are open-sourced at: https://github.com/upb-cn/vllm-startup-profiler

Wed 20 May 9:45 - 10:00 PDT

FaaScale: Unlocking Fast LLM Scaling for Serverless Inference

Minchen Yu ⋅ Rui Yang ⋅ Chaobo Jia ⋅ Zhaoyuan Su ⋅ Sheng Yao ⋅ Tingfeng Lan ⋅ Yuchen Yang ⋅ Zirui Wang ⋅ Yue Cheng ⋅ Wei Wang ⋅ Ao Wang ⋅ Ruichuan Chen

Serverless computing is an attractive paradigm for cloud-based large language model (LLM) inference, but scaling LLMs on demand remains a major challenge due to high data transfer cost. We present FaaScale, a serverless LLM system that enables fast and resource-efficient model scaling. The key idea is a co-design principle—pipelined multicast inference—which synergizes multicast with dynamic, cross-node pipeline-parallel execution during model transfer. FaaScale implements this design through PipeCast, a model scaling scheme that adaptively multicasts model blocks and dynamically forms inference pipelines on the fly. Coupled with efficient memory management across GPU and host memory, FaaScale handles bursty LLM inference workloads effectively, achieving up to 5× lower tail time-to-first-token latency and 31.3% cost reduction on real-world LLM traces.

Main Navigation

Session