Session

Efficient Training

Exhibit Hall A

Moderator: Bilge Acun



Wed 31 Aug 8:45 a.m. PDT — 10:15 a.m. PDT

Abstract:

Chat is not available.

Wed 31 Aug. 8:45 - 9:03 PDT

(Oral)
REX: Revisiting Budgeted Training with an Improved Schedule

John Chen · Cameron Wolfe · Tasos Kyrillidis

Deep learning practitioners often operate on a computational and monetary budget. Thus, it is critical to design optimization algorithms that perform well under any budget. The linear learning rate schedule is considered the best budget-aware schedule, as it outperforms most other schedules in the low budget regime. On the other hand, learning rate schedules --such as the \texttt{30-60-90} step schedule-- are known to achieve high performance when the model can be trained for many epochs. Yet, it is often not known a priori whether one's budget will be large or small; thus, the optimal choice of learning rate schedule is made on a case-by-case basis. In this paper, we frame the learning rate schedule selection problem as a combination of $i)$ selecting a profile (i.e., the continuous function that models the learning rate schedule), and $ii)$ choosing a sampling rate (i.e., how frequently the learning rate is updated/sampled from this profile). We propose a novel profile and sampling rate combination called the Reflected Exponential (REX) schedule, which we evaluate across seven different experimental settings with both SGD and Adam optimizers. REX outperforms the linear schedule in the low budget regime, while matching or exceeding the performance of several state-of-the-art learning rate schedules (linear, step, exponential, cosine, step decay on plateau, and OneCycle) in both high and low budget regimes. Furthermore, REX requires no added computation, storage, or hyperparameters.

Wed 31 Aug. 9:03 - 9:21 PDT

(Oral)
SRIFTY: Swift and Thrifty Distributed Neural Network Training on the Cloud

Liang Luo · Liang Luo · Peter West · Peter West · Pratyush Patel · Pratyush Patel · Arvind Krishnamurthy · Luis Ceze · Luis Ceze

Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks.In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.

Wed 31 Aug. 9:21 - 9:39 PDT

(Oral)
Accelerating Training and Inference of Graph Neural Networks with Fast Sampling and Pipelining

Tim Kaler · Nickolas Stathas · Anne Ouyang · Alexandros-Stavros Iliopoulos · Tao Schardl · Charles E. Leiserson · Jie Chen

Improving the training and inference performance of graph neural networks (GNNs) is faced with a challenge uncommon in general neural networks: creating mini-batches requires a lot of computation and data movement due to the exponential growth of multi-hop graph neighborhoods along network layers. Such a unique challenge gives rise to a diverse set of system design choices. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which we identify major performance bottlenecks hitherto under-explored by developers: mini-batch preparation and transfer. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler, a shared-memory parallelization strategy, and the pipelining of batch transfer with GPU computation. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised. Such an observation unifies training and inference, simplifying model implementation. We report comprehensive experimental results with several benchmark data sets and GNN architectures, including a demonstration that, for the ogbn-papers100M data set, our system SALIENT achieves a speedup of 3x over a standard PyTorch-Geometric implementation with a single GPU and a further 8x parallel speedup with 16 GPUs. Therein, training a 3-layer GraphSAGE model with sampling fanout (15, 10, 5) takes 2.0 seconds per epoch and inference with fanout (20, 20, 20) takes 2.4 seconds, attaining test accuracy 64.58%.

Wed 31 Aug. 9:39 - 9:57 PDT

(Oral)
Improving Model Training with Multi-fidelity Hyperparameter Evaluation

Yimin Huang · Yujun Li · Hanrong Ye · Zhenguo Li · Zhihua Zhang

The evaluation of hyperparameters, neural architectures, or data augmentation policies becomes a critical problem in advanced deep model training with a large hyperparameter search space. In this paper, we propose an efficient and robust bandit-based algorithm called Sub-Sampling (SS) in the scenario of hyperparameter search evaluation and its modified version for high GPU usage. It evaluates the potential of hyperparameters by the sub-samples of observations and is theoretically proved to be optimal under the criterion of cumulative regret. We further combine SS with Bayesian Optimization and develop a novel hyperparameter optimization algorithm called BOSS. Empirical studies validate our theoretical arguments of SS and demonstrate the superior performance of BOSS on a number of applications, including Neural Architecture Search (NAS), Data Augmentation (DA), Object Detection (OD), and Reinforcement Learning (RL).

Wed 31 Aug. 9:57 - 10:15 PDT

(Oral)
Gyro Dropout: Maximizing Ensemble Effect in Neural Network Training

Jiwon Seo

This paper proposes gyro dropout, a variant of dropout that improves the efficiency of training neural net-works. Instead of randomly dropping out neurons in every training iteration, gyro dropout pre-selects and trains a fixed number of subnetworks. Because each subnetwork is more stably trained, they are more diversified and thus their ensemble achieves good generalization. We further propose block-wise gyro dropout, or simply block-wise dropout, which is a GPU-friendly variant of gyro dropout. Block-wise dropout partitions hidden neurons into a number of groups that should be dropped out together throughout learning; this makes it efficient to prune the corresponding warp executions on GPUs. We evaluate the two dropout methods with seven neural networks and ten public datasets. In our evaluation, gyro dropout improves the accuracy of trained models by up to 1.93%; gyro dropout consistently achieves higher accuracy than conventional dropout in all experiments. Moreover, block-wise dropout speeds up the training of neural networks by up to 29.8% with little to no accuracy loss. Ourimplementation of gyro dropout is publicly available at https://github.com/mlsys-seo/gyro-dropout.