Poster 22

Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

Kevin Quirk ⋅ Matthew Lennie ⋅ Ehsan K. Ardestani ⋅ Satyajeet Ahuja ⋅ Matthew Bergeron ⋅ Andrew Grier ⋅ Zhaodong Wang ⋅ Mustafa Ozdal ⋅ Xu Zhang ⋅ Abhinav Triguna ⋅ Ying Zhang ⋅ Mathew Oldham ⋅ Chunqiang Tang

[ OpenReview]

Abstract

Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly impacts cluster performance. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta’s AI operations.

Chat is not available.