Skip to yearly menu bar Skip to main content


Oral

NodeSweep: Practical Straggler Detection and Health Monitoring for Large-Scale Foundation Model Training

Guanliang Liu ⋅ Zoe Zeng ⋅ ⋅ Cong Cheng ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ Alexander Zhipa ⋅ Ashvin Nihalani ⋅ Binxuan Huang ⋅ ⋅

Abstract

Chat is not available.