Oral Thu, May 21, 2026 • 3:00 PM – 3:15 PM PDT

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Zezhou Wang ⋅ Youjie Li ⋅ Zhiqi Lin ⋅ Jiacheng Yang ⋅ Cong Xie ⋅ Guanyu Feng ⋅ ZHENG ZHONG ⋅ Ziyue Huang ⋅ Hongyu Zhu ⋅ Zhi Zhang ⋅ Yanghua Peng ⋅ Xin Liu

Abstract

Fully Sharded Data Parallel (FSDP), also known as Zero Redundancy Optimizer (ZeRO), is widely used for large-scale model training, because of its memory efficiency and minimal intrusion on model code. However, existing FSDP systems rely on fixed element-wise or row-wise sharding formats that conflict with block-structured computations. As a result, they struggle to support modern structure-aware training methods, including block-wise quantization and non-element-wise optimizers such as Shampoo and Muon. In addition, today's implementations incur communication and memory overheads that degrade efficiency at the scale of tens of thousands of GPUs. We introduce veScale-FSDP, a novel FSDP system that combines RaggedShard, a flexible sharding format, with a structure-aware planning algorithm to deliver both flexibility and performance. veScale-FSDP enables zero-copy FSDP communications and natively supports block-wise quantization and non-element-wise optimizers, achieving 5% to 66% higher throughput and 16% to 30% lower memory usage than existing FSDP systems, while scaling efficiently to tens of thousands of GPUs.