Oral Thu, May 21, 2026 • 1:30 PM – 1:45 PM PDT

Optimizing Deployment Configurations for LLM Inference

Sungmin Cho ⋅ Jaewon Lee ⋅ Chunqiang Tang ⋅ Yejin Lee ⋅ Geonhwa Jeong ⋅ Anca Agape ⋅ Scott Batura ⋅ Vincent Boivin ⋅ Stephen Chen ⋅ Renfei Chen ⋅ Sijia Chen ⋅ Yan Cui ⋅ Bradley Davis ⋅ Summer Deng ⋅ Nick Egebo ⋅ Emad El-Haraty ⋅ Sebastien Estienne ⋅ Lu Fang ⋅ Lu Fang ⋅ Joshua Fromm ⋅ Raj Ganapathy ⋅ Vedanuj Goswami ⋅ Liangpeng Guo ⋅ Ye Hu ⋅ Chenheli Hua ⋅ Jianyu Huang ⋅ Aya Ibrahim ⋅ Niranjan Jagannath ⋅ Hongyi Jia ⋅ Changkyu Kim ⋅ Shikai Li ⋅ Brandon Liu ⋅ Jiawen Liu ⋅ Ajit Mathews ⋅ Xiaozhu Meng ⋅ Vlad Tiberiu Mihailescu ⋅ Amit Nagpal ⋅ Maxim Naumov ⋅ Michal Ostrowski ⋅ Jialin Ouyang ⋅ Jason Park ⋅ Sarunya Pumma ⋅ Ye Qi ⋅ Zixi Qi ⋅ Jeremy Francis Reizenstein ⋅ Rajasi Saha ⋅ Nandhini Santhanam ⋅ Zhan Shu ⋅ Ruan Silva ⋅ Grigory Sizov ⋅ Jon Swenson ⋅ Brandon Taylor ⋅ Chris Thi ⋅ Adolfo Victoria ⋅ Yunfan Wang ⋅ Pengchao Wang ⋅ Wenchen Wang ⋅ Xiaodong Wang ⋅ Bram Wasti ⋅ Wei Xu ⋅ Qirui Yang ⋅ Jingyi Yang ⋅ Hector Yuen ⋅ Zhengyuan Zhang ⋅ Jing Zhang ⋅ Yi Zhen ⋅ Yanjun Zhou

[ Slides] [ OpenReview]

Abstract

Meta's Large Language Models (LLMs)---the Llama model family---serve nearly one billion monthly active users. Deploying these models for inference involves navigating a complex design space that spans diverse hardware options (e.g., H100, H200, MI300X), multiple parallelism strategies (tensor, pipeline, expert, context, and data parallelism), and nuanced runtime choices (e.g., continuous batching versus prefill-decode disaggregation)---all while leveraging workload-specific characteristics and meeting stringent service level objectives (SLOs). This paper presents insights we gained from developing and applying a systematic approach to analyze millions of deployment configurations and identify those that maximize throughput while meeting latency SLOs. We share lessons learned from our experience operating Llama inference at scale, including trade-offs among runtime designs, the phase-specific nature of parallelism strategies, opportunities for leveraging hardware heterogeneity, platform scaling behaviors, and system-level implications of model architectures such as Mixture-of-Experts (MoE). We hope our production experience offers practical insights for the broader LLM inference community.

Chat is not available.