PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving
Abstract
Optimizing large-language model (LLM) training and serving on large-scale distributed systems is a significant challenge. This difficulty stems from the rapidly evolving LLM landscape, the requirement for deep domain expertise, and the need for workload-specific optimization strategies. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming, or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce \textbf{PROMPTS}, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning to deliver system-level optimization with much fewer shots. Key components of the proposed framework include an \textit{Analyzer Agent} that diagnoses performance bottlenecks by synthesizing profiler data and a \textit{Proposal Agent} that leverages a knowledge base to generate optimized sharding configurations with detailed justifications through retrieval-augmented generation (RAG). Experimental results across eight real-world LLM workloads have demonstrated that PROMPTS can provide valid reasoning and accurate recommendations by considering LLM workload characteristics and backend hardware features, delivering performance improvements of up to \textbf{434\%}. These workloads spanned LLMs with Mixture-of-Experts (MoE) and dense models, system configurations from 2-TPU chips to 512-chip systems with 2D/3D Torus interconnects, and the full LLM lifecycle including pre-training, post-training, and serving. To validate our agent's system optimization proposals, we benchmarked them against production configurations that were previously optimized by experts, either through extensive manual analysis or automated black-box searches. In every case, our agent independently identified this expert-validated solution within its top three recommendations from a \textbf{single invocation}. Furthermore, the agent's top-ranked recommendation matched the production solution in \textbf{87.5\%} of cases, demonstrating its ability to not only find optimized configurations but also to correctly prioritize the optimization candidates.