Poster 16

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Tiyasa Mitra ⋅ Ritika Borkar ⋅ Nidhi Bhatia ⋅ Shivam Raj ⋅ hongkuan zhou ⋅ Yan Ru Pei ⋅ Vishwanath Venkatesan ⋅ Kyle Kranen ⋅ Ramon Matas ⋅ Dheevatsa Mudigere ⋅ Ritchie Zhao ⋅ Maximilian Golub ⋅ Arpan Dutta ⋅ Suresh Nambi ⋅ Sailaja Madduri ⋅ Dharmesh Jani ⋅ Brian Pharris ⋅ Itay Neeman ⋅ Bita Darvish Rouhani

Abstract

As inference scales to multi-node deployments, prefill-decode disaggregation — splitting inference into distinct phases — offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, large-scale deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. These insights, in conjunction with the deployment flexibility offered by NVIDIA Dynamo, provide a foundation to navigate the trade-off between system throughput and interactivity in efficient disaggregated deployments.