Poster 3

PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving

Yuran Ding ⋅ Ruobing Han ⋅ Xiaofan Zhang ⋅ Xinwei Chen

[ OpenReview]

Abstract

Optimizing large-language model (LLM) training and serving on large-scale distributed systems is a significant challenge. This difficulty stems from the rapidly evolving LLM landscape, the requirement for deep domain expertise, and the need for workload-specific optimization strategies. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming, or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce PROMPTS, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning to deliver system-level optimization with much fewer shots. Key components of the proposed framework include an Analyzer Agent that diagnoses performance bottlenecks by synthesizing profiler data and a Proposal Agent that leverages a knowledge base to generate optimized sharding configurations with detailed justifications through retrieval-augmented generation (RAG). Experimental results across eight real-world LLM workloads have demonstrated that PROMPTS can provide valid reasoning and accurate recommendations by considering LLM workload characteristics and backend hardware features, delivering performance improvements of up to 434%. These workloads spanned LLMs with Mixture-of-Experts (MoE) and dense models, system configurations from 2-TPU chips to 512-chip systems with 2D/3D Torus interconnects, and the full LLM lifecycle including pre-training, post-training, and serving. To validate our agent's system optimization proposals, we benchmarked them against production configurations that were previously optimized by experts, either through extensive manual analysis or automated black-box searches. In every case, our agent independently identified this expert-validated solution within its top three recommendations from a single invocation. Furthermore, the agent's top-ranked recommendation matched the production solution in 87.5% of cases, demonstrating its ability to not only find optimized configurations but also to correctly prioritize the optimization candidates.

Chat is not available.