Session
Industry-Track Oral Presentation: I1: LLM Serving
Grand Ballroom 1
BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching
Zhen Zheng ⋅ Xin Ji ⋅ Taosong Fang ⋅ Fanghao Zhou ⋅ Chuanjie Liu ⋅
Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\times$ to $10.8\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments.
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Tiyasa Mitra ⋅ Ritika Borkar ⋅ Nidhi Bhatia ⋅ Shivam Raj ⋅ hongkuan zhou ⋅ Yan Ru Pei ⋅ ⋅ Kyle ⋅ Ramon Matas ⋅ Dheevatsa Mudigere ⋅ Ritchie Zhao ⋅ Maximilian Golub ⋅ Arpan Dutta ⋅ ⋅ Sailaja Madduri ⋅ Dharmesh Jani ⋅ Brian Pharris ⋅ Itay Neeman ⋅ Bita Darvish Rouhani
As inference scales to multi-node deployments, prefill-decode disaggregation — splitting inference into distinct phases — offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, large-scale deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. These insights, in conjunction with the deployment flexibility offered by NVIDIA Dynamo, provide a foundation to navigate the trade-off between system throughput and interactivity in efficient disaggregated deployments.
Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
Nicholas Santavas ⋅ Kareem Eissa ⋅ ⋅ Piotr Florek ⋅ Matteo Nulli ⋅ Stefan Vasilev ⋅ Seyyed Hashemi ⋅ Antonios Gasteratos ⋅ Shahram Khadivi
Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2× GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.
Optimizing Deployment Configurations for LLM Inference
Sungmin Cho ⋅ Jaewon Lee ⋅ Chunqiang Tang ⋅ Yejin Lee ⋅ Geonhwa Jeong ⋅ ⋅ Scott Batura ⋅ ⋅ ⋅ ⋅ Sijia Chen ⋅ ⋅ Bradley Davis ⋅ Summer Deng ⋅ ⋅ Emad El-Haraty ⋅ ⋅ Lu Fang ⋅ Lu Fang ⋅ Joshua Fromm ⋅ ⋅ ⋅ Liangpeng Guo ⋅ ⋅ ⋅ Jianyu Huang ⋅ Aya Ibrahim ⋅ ⋅ Hongyi Jia ⋅ Changkyu Kim ⋅ ⋅ ⋅ ⋅ ⋅ Xiaozhu Meng ⋅ Vlad Tiberiu Mihailescu ⋅ ⋅ Maxim Naumov ⋅ Michal Ostrowski ⋅ ⋅ ⋅ Sarunya Pumma ⋅ ⋅ ⋅ Jeremy Francis Reizenstein ⋅ Rajasi Saha ⋅ ⋅ ⋅ Ruan Silva ⋅ ⋅ Jon Swenson ⋅ ⋅ Chris Thi ⋅ ⋅ Yunfan Wang ⋅ Pengchao Wang ⋅ Wenchen Wang ⋅ ⋅ Bram Wasti ⋅ ⋅ ⋅ Jingyi Yang ⋅ ⋅ ⋅ Jing Zhang ⋅ Yi Zhen ⋅
Meta's Large Language Models (LLMs)---the Llama model family---serve nearly one billion monthly active users. Deploying these models for inference involved navigating a complex design space that spanned diverse hardware options (e.g., H100, H200, MI300X), multiple parallelism strategies (tensor, pipeline, expert, context, and data parallelism), and nuanced runtime choices (e.g., continuous batching versus prefill-decode disaggregation)---all while leveraging workload-specific characteristics and meeting stringent service level objectives (SLOs). This paper presents insights we gained from developing and applying a systematic approach to analyze millions of deployment configurations and identify those that maximize throughput while meeting latency SLOs. We share lessons learned from our experience operating Llama inference at scale, including trade-offs among runtime designs, the phase-specific nature of parallelism strategies, opportunities for leveraging hardware heterogeneity, platform scaling behaviors, and system-level implications of model architectures such as Mixture-of-Experts (MoE). We hope our production experience offers practical insights for the broader LLM inference community.
Scaling Up Large Language Models Serving Systems for Semantic Job Search
Kayhan Behdin ⋅ Qingquan Song ⋅ Sriram Vasudevan ⋅ Jian Sheng ⋅ Xiaojing Ma ⋅ Zhengze Zhou ⋅ Chuanrui Zhu ⋅ Guoyao Li ⋅ Chanh Nguyen ⋅ ⋅ Hejian Sang ⋅ Ata Fatahi ⋅ ⋅ Xiaoqing Wang ⋅ Qing Lan ⋅ ⋅ Qi Guo ⋅ Caleb Johnson ⋅ Zhipeng Wang ⋅
Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to 40% while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by more than 10x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system’s throughput by 10x in a real-world deployment, while meeting our quality bar.
The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.