Session
Industry Track Oral Presentation: LLM Serving 6
Grand Ballroom 1
Moderator: Esha Choukse
SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving
Andrew Bitar ⋅ Aravind Vayalapra ⋅ Baorui Zhou ⋅ Matthew Boyd ⋅ Charlie Wang ⋅ Sahil Parmar ⋅ Eugene Sha ⋅ Gautam Rayaprolu ⋅ Peter Hicks ⋅ Alex Bowe ⋅ Roberto DiCecco ⋅ Santosh Raghavan ⋅ Evan Patrick ⋅ Josip Smolcic ⋅ David Han ⋅ Kris Kang ⋅ Andy Rock ⋅ Josh Hay ⋅ Mohamed Eldafrawy ⋅ Mikhail Kandel ⋅ Daulet Zhanguzin ⋅ Omar Kilani ⋅ Liming Gong ⋅ Andrew Paprotskyi ⋅ Arash Taheri-Dezfouli ⋅ Josh Fender ⋅ Andrew Ling
The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Tiyasa Mitra ⋅ Ritika Borkar ⋅ Nidhi Bhatia ⋅ Shivam Raj ⋅ hongkuan zhou ⋅ Yan Ru Pei ⋅ Vishwanath Venkatesan ⋅ Kyle Kranen ⋅ Ramon Matas ⋅ Dheevatsa Mudigere ⋅ Ritchie Zhao ⋅ Maximilian Golub ⋅ Arpan Dutta ⋅ Suresh Nambi ⋅ Sailaja Madduri ⋅ Dharmesh Jani ⋅ Brian Pharris ⋅ Itay Neeman ⋅ Bita Darvish Rouhani
As inference scales to multi-node deployments, prefill-decode disaggregation — splitting inference into distinct phases — offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, large-scale deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. These insights, in conjunction with the deployment flexibility offered by NVIDIA Dynamo, provide a foundation to navigate the trade-off between system throughput and interactivity in efficient disaggregated deployments.
Optimizing Deployment Configurations for LLM Inference
Sungmin Cho ⋅ Jaewon Lee ⋅ Chunqiang Tang ⋅ Yejin Lee ⋅ Geonhwa Jeong ⋅ Anca Agape ⋅ Scott Batura ⋅ Vincent Boivin ⋅ Stephen Chen ⋅ Renfei Chen ⋅ Sijia Chen ⋅ Yan Cui ⋅ Bradley Davis ⋅ Summer Deng ⋅ Nick Egebo ⋅ Emad El-Haraty ⋅ Sebastien Estienne ⋅ Lu Fang ⋅ Lu Fang ⋅ Joshua Fromm ⋅ Raj Ganapathy ⋅ Vedanuj Goswami ⋅ Liangpeng Guo ⋅ Ye Hu ⋅ Chenheli Hua ⋅ Jianyu Huang ⋅ Aya Ibrahim ⋅ Niranjan Jagannath ⋅ Hongyi Jia ⋅ Changkyu Kim ⋅ Shikai Li ⋅ Brandon Liu ⋅ Jiawen Liu ⋅ Ajit Mathews ⋅ Xiaozhu Meng ⋅ Vlad Tiberiu Mihailescu ⋅ Amit Nagpal ⋅ Maxim Naumov ⋅ Michal Ostrowski ⋅ Jialin Ouyang ⋅ Jason Park ⋅ Sarunya Pumma ⋅ Ye Qi ⋅ Zixi Qi ⋅ Jeremy Francis Reizenstein ⋅ Rajasi Saha ⋅ Nandhini Santhanam ⋅ Zhan Shu ⋅ Ruan Silva ⋅ Grigory Sizov ⋅ Jon Swenson ⋅ Brandon Taylor ⋅ Chris Thi ⋅ Adolfo Victoria ⋅ Yunfan Wang ⋅ Pengchao Wang ⋅ Wenchen Wang ⋅ Xiaodong Wang ⋅ Bram Wasti ⋅ Wei Xu ⋅ Qirui Yang ⋅ Jingyi Yang ⋅ Hector Yuen ⋅ Zhengyuan Zhang ⋅ Jing Zhang ⋅ Yi Zhen ⋅ Yanjun Zhou
Meta's Large Language Models (LLMs)---the Llama model family---serve nearly one billion monthly active users. Deploying these models for inference involves navigating a complex design space that spans diverse hardware options (e.g., H100, H200, MI300X), multiple parallelism strategies (tensor, pipeline, expert, context, and data parallelism), and nuanced runtime choices (e.g., continuous batching versus prefill-decode disaggregation)---all while leveraging workload-specific characteristics and meeting stringent service level objectives (SLOs). This paper presents insights we gained from developing and applying a systematic approach to analyze millions of deployment configurations and identify those that maximize throughput while meeting latency SLOs. We share lessons learned from our experience operating Llama inference at scale, including trade-offs among runtime designs, the phase-specific nature of parallelism strategies, opportunities for leveraging hardware heterogeneity, platform scaling behaviors, and system-level implications of model architectures such as Mixture-of-Experts (MoE). We hope our production experience offers practical insights for the broader LLM inference community.
Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT
Nicholas Santavas ⋅ Kareem Eissa ⋅ Patrycja Cieplicka ⋅ Piotr Florek ⋅ Matteo Nulli ⋅ Stefan Vasilev ⋅ Seyyed Hashemi ⋅ Antonios Gasteratos ⋅ Shahram Khadivi
Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2× GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.
Scaling Up Large Language Models Serving Systems for Semantic Job Search
Kayhan Behdin ⋅ Qingquan Song ⋅ Sriram Vasudevan ⋅ Jian Sheng ⋅ Xiaojing Ma ⋅ Zhengze Zhou ⋅ Chuanrui Zhu ⋅ Guoyao Li ⋅ Chanh Nguyen ⋅ Sayan Ghosh ⋅ Hejian Sang ⋅ Ata Fatahi ⋅ Sundara Ramachandran ⋅ Xiaoqing Wang ⋅ Qing Lan ⋅ Vinay S ⋅ Qi Guo ⋅ Caleb Johnson ⋅ Zhipeng Wang ⋅ Fedor Borisyuk
Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to 40% while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by more than 10x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system’s throughput by 10x in a real-world deployment, while meeting our quality bar.