Skip to yearly menu bar Skip to main content


Session

Research-Track Oral Presentation: R1: LLM Serving

Grand Ballroom 2
Tue 19 May 2:45 p.m. PDT — 4:15 p.m. PDT
Abstract:
Chat is not available.

Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to \textbf{39\%} and inflate energy consumption. We propose \textbf{layered prefill}, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to \textbf{70\%}, End-to-End latency by \textbf{41\%} and per-token energy by up to \textbf{22\%}. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.


HELIOS : Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

Avinash Kumar ⋅ Shashank Nag ⋅ Jason Clemons ⋅ LIZY JOHn ⋅ Poulami Das

Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes. We propose \textit{HELIOS}, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. \textit{First}, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. \textit{Second}, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\times$ higher throughput and $15.14\times$ larger batch size compared to existing EE-LLM frameworks.


PLA-Serve: A Prefill-Length-Aware LLM Serving System

Jianshu She ⋅ Zonghang Li ⋅ HONGCHAO DU ⋅ Shangyu Wu ⋅ Wenhao Zheng ⋅ Eric Xing ⋅ Zhengzhong Liu ⋅ Huaxiu Yao ⋅ Chun Jason Xue ⋅ Qirong Ho

PLA-Serve identifies and disaggregates requests with different prompt lengths in LLM serving to reduce TTFT latency. While recent systems have decoupled the prefill and decode stages to improve throughput, they still rely on unified scheduling policies that fail to adapt to heterogeneous workload characteristics. We observe that prompt-length variations lead to distinct performance bottlenecks, motivating an adaptive scheduling strategy. PLA-Serve disaggregates multi-round long-prefill requests from short-prefill ones and introduces a length-aware smart batching mechanism for short-prefill workloads. It adopts a dual-queue design that supports temporal disaggregation on a single prefill instance or spatial disaggregation across multiple instances. For short-prefill batches, a batch waiting window and CUDA Graph–based clustering mitigate interference from heterogeneous computation, reducing batching delay and lowering average latency. In real multi-turn workloads, PLA-Serve reduces short-prefill latency by over 30% compared to vanilla SGLang under prefill–decode disaggregation, and decreases SLO violations by 28% in multi-instance deployments. Compared to the SGLang router with load balancing, it further lowers SLO violations by 12% in multi-GPU settings. Under high concurrency and mixed-request scenarios, PLA-Serve improves throughput by up to 35% for prefill instance, demonstrating its effectiveness in optimizing heterogeneous LLM serving workloads.


Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token

Rajveer Bachkaniwala ⋅ ⋅ Richard So ⋅ Divya Mahajan ⋅ Kexin Rong

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming–overlapping retrieval with inference–but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a system that extends vLLM to support streaming prompts with adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). Stream2LLM decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses cache invalidation based on longest common prefix matching to minimize redundant computation when prompts change dynamically. To evaluate Stream2LLM, we collect and characterize two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11× TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, while maintaining throughput parity with non-streaming baselines.

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, a high-performance rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.

Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20\%$ even over GPUs connected via NVLink. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these computation subtasks. However, as of this writing, none of the open-source LLM serving systems (vLLM, SGLANG, TensorRT-LLM) support compute-communication overlap for LLMs served using tensor parallelism. This is because the number of tokens processed per iteration is kept small to support low latency serving and decomposing these smaller workloads to enable communication overlap results in worse performance. We present TOKENBLEND, the first system to enable efficient compute-communication overlap for tensor-parallel models for token lengths as small as 1024. TOKENBLEND identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused \textbf{AllReduce--RMSNorm} kernel. Further, this kernel leverages the multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only 2--8 SMs. Our evaluations demonstrate up to $\boldsymbol{1.28\times}$ speedup in latency and $\boldsymbol{1.19\times}$ higher throughput across multiple models and workloads. In several settings, TOKENBLEND delivers \textit{better performance than an equivalent model with all communication removed}. The source code of TOKENBLEND is available at https://anonymous.4open.science/r/tokenblend-mlsys/.