FlexScale: Flexible and High-Performance FSDP at Scale
Zezhou Wang ⋅ Youjie Li ⋅ Zhiqi Lin ⋅ Jiacheng Yang ⋅ Cong Xie ⋅ ⋅ ZHENG ZHONG ⋅ ⋅ Hongyu Zhu ⋅ Zhi Zhang ⋅ Xin Liu ⋅ Yanghua Peng
Abstract
Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods—e.g., block-wise quantized training—and with optimizers such as Shampoo and Muon used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today’s implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce FlexScale, a redesigned FSDP framework that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. FlexScale natively supports efficient data placement required by FSDP, accommodates non-element-wise optimizers and block-wise quantization. As a result, FlexScale achieves 5$\sim$66\% higher throughput and 16$\sim$30\% lower memory usage than existing FSDP systems, while efficiently scales to 30K GPUs. FlexScale has been battle-tested in production and will be open-sourced to the MLSys community upon acceptance.
Chat is not available.
Successful Page Load