Skip to yearly menu bar Skip to main content


Poster

SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling

Ke Hong · Xiuhong Li · Lufang Chen · Qiuli Mao · Guohao Dai · Xuefei Ning · Shengen Yan · Yun Liang · Yu Wang


Abstract: Serving large language models (LLMs) efficiently requires elaborate request scheduling to satisfy service-level objectives (SLOs).In the context of LLM serving, SLOs include the constraints on Time-to-First-Token (TTFT) and Time-per-Output-Token (TPOT).Existing serving systems apply a coarse-grained request scheduling that follows a fixed principle at different iterations during the serving procedure, leading to (1) a significant distribution bias between TTFT and TPOT and (2) a significant distribution variance among different requests as shown in Fig. 1(a), and hence causes disappointing SLO attainment.We identify that fine-grained scheduling based on a formal description of the design space addresses the issues mentioned above.To this end, we first formulate a scheduling design space with flexible control of the request execution order and the workload at each iteration. Based on that, we introduce a state-aware scheduling strategy, which enables the awareness of two kinds of states: the states from the single request perspective and the states from the systemic perspective, and further balances between TTFT and TPOT and balances among different requests to improve the SLO attainment, as shown in Fig. 2. We implement SOLA with the above insights. The evaluation shows that SOLA enhances the SLO attainment from 45.5\% to 99.4\%, thus serving more requests. Given SLO constraints, SOLA serves 1.04-1.27$\times$ more requests than the state-of-the-art systems on average.

Chat is not available.