Parallel training of an ensemble of Deep Neural Networks (DNN) on a cluster of nodes is an effective approach to shorten the process of neural network architecture search and hyper-parameter tuning for a given learning task. Prior efforts have shown that data sharing, where the common preprocessing operation is shared across the DNN training pipelines, saves computational resources and improves pipeline efficiency. Data sharing strategy, however, performs poorly for a heterogeneous set of DNNs where each DNN has varying computational needs and thus different training rate and convergence speed. This paper proposes FLEET, a flexible ensemble DNN training framework for efficiently training a heterogeneous set of DNNs. We build FLEET via several technical innovations. We theoretically prove that an optimal resource allocation is NP-hard and propose a greedy algorithm to efficiently allocate resources for training each DNN with data sharing. We integrate data-parallel DNN training into ensemble training to mitigate the differences in training rates. We introduce checkpointing into this context to address the issue of different convergence speeds. Experiments show that FLEET significantly improves the training efficiency of DNN ensembles without compromising the quality of the result.
Hui Guan (North Carolina State University)
Laxmikant Kishor Mokadam (North Carolina State University)
Xipeng Shen (North Carolina State University)
Prof. Xipeng Shen joined North Carolina State University in August 2014 as a Chancellor’s Faculty Excellence Program cluster hire in Data-Driven Science. He is a receipt of the DOE Early Career Award, NSF CAREER Award, Google Faculty Research Award, and IBM CAS Faculty Fellow Award. He is an ACM Distinguished Member, ACM Distinguished Speaker, and a senior member of IEEE. He was honored with University Faculty Scholars Award "as an emerging academic leader who turns research into solutions to society’s most pressing issues".
Seung-Hwan Lim (Oak Ridge National Laboratory)
Robert Patton (Rensselaer Polytechnic Institute, Oak Ridge National Laboratory)
Related Events (a corresponding poster, oral, or spotlight)
2020 Oral: FLEET: Flexible Efficient Ensemble Training for Heterogeneous Deep Neural Networks »
Mon Mar 2nd 11:20 -- 11:45 AM Room Ballroom A