Keywords: [ efficient training ] [ systems for ml ] [ distributed and parallel learning ]
Deep Neural Networks (DNNs) are often trained in parallel on a cluster of virtual machines (VMs) so as to reduce training time. However, this requires explicit cluster management, which is cumbersome and often results in costly overprovisioning of resources. Training DNNs on serverless compute is an attractive alternative that is receiving growing interest. In a serverless environment, users do not need to handle cluster management and can scale compute resources at a fine-grained level while paying for resources only when actively used. Despite these potential benefits, existing serverless systems for DNN training are ineffective because they are limited to CPU-based training and bottlenecked by expensive distributed communication. We present Hydrozoa, a system that trains DNNs on serverless containers with a hybrid-parallel architecture that flexibly combines data- and model-parallelism. Hydrozoa supports GPU-based training and leverages hybrid-parallelism and serverless resource scaling to achieve up to 155.5x and 5.4x higher throughput-per-dollar compared to existing serverless and VM-based training systems. Hydrozoa also allows users to implement dynamic worker-scaling policies during training. We show that dynamic worker scaling improves statistical training efficiency and reduces training costs.