Timezone: »
The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2--8.5% to 0.53--0.68% compared to full recovery, on a setup emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo’s Terabyte CTR dataset. Our results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.
Author Information
Kiwan Maeng (Carnegie Mellon University)
Shivam Bharuka (Facebook Inc.)
I am a production engineer, supporting the AI infrastructure at Facebook. My research interests include exploring new computer architectures and designing scalable distributed systems. Before joining Facebook, I graduated with a master’s and a bachelor’s degree from University of Illinois at Urbana-Champaign, studying Computer Engineering.
Isabel Gao (Facebook)
Mark Jeffrey (University of Toronto)
Vikram Saraph (The Johns Hopkins University Applied Physics Laboratory)
Bor-Yiing Su (Facebook)
Caroline Trippel (Stanford / Facebook)
Jiyan Yang (Facebook Inc.)
Mike Rabbat (Facebook FAIR)
Brandon Lucia (Carnegie Mellon University)
Carole-Jean Wu (Facebook AI)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Poster: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery »
09 Apr 12:00 AM Room Virtual
More from the Same Authors
-
2022 Poster: Sustainable AI: Environmental Implications, Challenges and Opportunities »
Carole-Jean Wu · Ramya Raghavendra · Udit Gupta · Bilge Acun · Newsha Ardalani · Kiwan Maeng · Gloria Chang · Fiona Aga · Jinshi Huang · Charles Bai · Michael Gschwind · Anurag Gupta · Myle Ott · Anastasia Melnikov · Salvatore Candido · David Brooks · Geeta Chauhan · Benjamin Lee · Hsien-Hsin Lee · Bugra Akyildiz · Maximilian Balandat · Joe Spisak · Ravi Jain · Mike Rabbat · Kim Hazelwood -
2022 Poster: PAPAYA: Practical, Private, and Scalable Federated Learning »
Dzmitry Huba · John Nguyen · Kshitiz Malik · Ruiyu Zhu · Mike Rabbat · Ashkan Yousefpour · Carole-Jean Wu · Hongyuan Zhan · Pavel Ustinov · Harish Srinivas · Kaikai Wang · Anthony Shoumikhin · Jesik Min · Mani Malek -
2022 Break: Closing Remarks »
Diana Marculescu · Yuejie Chi · Carole-Jean Wu -
2022 Oral: Sustainable AI: Environmental Implications, Challenges and Opportunities »
Carole-Jean Wu · Ramya Raghavendra · Udit Gupta · Bilge Acun · Newsha Ardalani · Kiwan Maeng · Gloria Chang · Fiona Aga · Jinshi Huang · Charles Bai · Michael Gschwind · Anurag Gupta · Myle Ott · Anastasia Melnikov · Salvatore Candido · David Brooks · Geeta Chauhan · Benjamin Lee · Hsien-Hsin Lee · Bugra Akyildiz · Maximilian Balandat · Joe Spisak · Ravi Jain · Mike Rabbat · Kim Hazelwood -
2022 Oral: PAPAYA: Practical, Private, and Scalable Federated Learning »
Dzmitry Huba · John Nguyen · Kshitiz Malik · Ruiyu Zhu · Mike Rabbat · Ashkan Yousefpour · Carole-Jean Wu · Hongyuan Zhan · Pavel Ustinov · Harish Srinivas · Kaikai Wang · Anthony Shoumikhin · Jesik Min · Mani Malek -
2022 Break: Opening Remarks »
Diana Marculescu · Yuejie Chi · Carole-Jean Wu -
2021 : Panel Session - Lizy John (UT Austin), David Kaeli (Northeastern University), Tushar Krishna (Georgia Tech), Peter Mattson (Google), Brian Van Essen (LLNL), Venkatram Vishwanath (ANL), Carole-Jean Wu (Facebook) »
Tom St John · LIZY JOHn · Tushar Krishna · Peter Mattson · Venkatram Vishwanath · Carole-Jean Wu · David Kaeli · Brian Van Essen -
2021 : Closing session »
Udit Gupta · Carole-Jean Wu -
2021 : "Designing and Optimizing AI Systems for Deep Learning Recommendation and Beyond" - Carole-Jean Wu (Facebook) »
Carole-Jean Wu -
2021 Workshop: Personalized Recommendation Systems and Algorithms »
Udit Gupta · Carole-Jean Wu · Gu-Yeon Wei · David Brooks -
2021 : Welcome to the 3rd PeRSonAl workshop »
Udit Gupta · Carole-Jean Wu -
2021 Poster: TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models »
Chunxing Yin · Bilge Acun · Carole-Jean Wu · Xing Liu -
2021 Oral: TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models »
Chunxing Yin · Bilge Acun · Carole-Jean Wu · Xing Liu -
2020 Oral: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia -
2020 Poster: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia