Timezone: »
Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to 15.1x higher training throughput vs. the standard practice of running each job on a separate accelerator.
Author Information
Shang Wang (NVIDIA)
Peiming Yang (Shanghai Jiao Tong University)
Yuxuan Zheng (Intel)
Xin Li (Vector Institute)
Gennady Pekhimenko (University of Toronto)
Related Events (a corresponding poster, oral, or spotlight)
-
2021 Oral: Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models »
Thu. Apr 8th 04:30 -- 04:50 PM Room
More from the Same Authors
-
2022 Poster: DietCode: Automatic Optimization for Dynamic Tensor Programs »
Bojian Zheng · Ziheng Jiang · Cody Hao Yu · Haichen Shen · Joshua Fromm · Yizhi Liu · Yida Wang · Luis Ceze · Tianqi Chen · Gennady Pekhimenko -
2023 Poster: Hotline Profiler: Automatic Annotation and A Multi-Scale Timeline for Visualizing Time-Use in DNN Training »
Daniel Snider · Fanny Chevalier · Gennady Pekhimenko -
2022 Symposium: Chips & Compilers »
Yida Wang · Gennady Pekhimenko -
2022 Oral: DietCode: Automatic Optimization for Dynamic Tensor Programs »
Bojian Zheng · Ziheng Jiang · Cody Hao Yu · Haichen Shen · Joshua Fromm · Yizhi Liu · Yida Wang · Luis Ceze · Tianqi Chen · Gennady Pekhimenko -
2021 : Industry/Academia Panel »
Zachary C Lipton · Udit Gupta · Lillian Pentecost · Shagun Sodhani · Abhishek Gupta · Mayoore Jaiswal · Michael Carbin · Devi Parikh · Gennady Pekhimenko -
2021 : "Machine Learning Tools: Skyline and RL-Scope" - Gennady Pekhimenko and James Gleeson (University of Toronto) »
Gennady Pekhimenko -
2021 Poster: Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick »
Isak Edo Vivancos · Sayeh Sharify · Daniel Ly-Ma · Ameer Abdelhadi · Ciaran Bannon · Milos Nikolic · Mostafa Mahmoud · Alberto Delmas Lascorz · Gennady Pekhimenko · Andreas Moshovos -
2021 Oral: Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick »
Isak Edo Vivancos · Sayeh Sharify · Daniel Ly-Ma · Ameer Abdelhadi · Ciaran Bannon · Milos Nikolic · Mostafa Mahmoud · Alberto Delmas Lascorz · Gennady Pekhimenko · Andreas Moshovos -
2021 Poster: RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads »
James Gleeson · Srivatsan Krishnan · Moshe Gabel · Vijay Janapa Reddi · Eyal de Lara · Gennady Pekhimenko -
2021 Poster: IOS: Inter-Operator Scheduler for CNN Acceleration »
Yaoyao Ding · Ligeng Zhu · Zhihao Jia · Gennady Pekhimenko · Song Han -
2021 Oral: IOS: Inter-Operator Scheduler for CNN Acceleration »
Yaoyao Ding · Ligeng Zhu · Zhihao Jia · Gennady Pekhimenko · Song Han -
2021 Oral: RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads »
James Gleeson · Srivatsan Krishnan · Moshe Gabel · Vijay Janapa Reddi · Eyal de Lara · Gennady Pekhimenko -
2020 Oral: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia -
2020 Poster: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia -
2020 Poster: BPPSA: Scaling Back-propagation by Parallel Scan Algorithm »
Shang Wang · Yifan Bai · Gennady Pekhimenko -
2020 Demonstration: Skyline: Interactive In-editor Performance Visualizations and Debugging for DNN Training »
Geoffrey Yu · Tovi Grossman · Gennady Pekhimenko -
2020 Oral: BPPSA: Scaling Back-propagation by Parallel Scan Algorithm »
Shang Wang · Yifan Bai · Gennady Pekhimenko