Track: Session 4: Training (I)

Tue 6 April 15:20 - 15:40 PDT

TensorFlow Lite Micro: Embedded Machine Learning for TinyML Systems

Robert David · Jared Duke · Advait Jain · Vijay Janapa Reddi · Nat Jeffries · Jian Li · Nick Kreeger · Ian Nappier · Meghna Natraj · Tiezhen Wang · Pete Warden · Rocky Rhodes · Rocky Rhodes

We introduce TensorFlow (TF) Micro, an open-source machine learning inference framework for running deep-learning models on embedded systems. TF Micro tackles the efficiency requirements imposed by embedded system resource constraints and the fragmentation challenges that make cross-platform interoperability nearly impossible. The framework adopts a unique interpreter-based approach that provides flexibility while overcoming the challenges. This paper explains the design decisions behind TF Micro and describes its implementation. We present an evaluation to demonstrate its low resource requirement and minimal run-time performance overhead.

Tue 6 April 15:40 - 16:00 PDT

Scaling Distributed Training with Adaptive Summation

Saeed Maleki · Madan Musuvathi · Todd Mytkowicz · Olli Saarikivi · Tianju Xu · Vadim Eksarevskiy · Jaliya Ekanayake · Emad Barsoum

Data parallelism is a common way to parallelize stochastic gradient descent (SGD). However, the loss of convergence at large minibatch sizes limits the scalability of data parallelism. This paper introduces a novel method to combine gradients called Adasum that significantly improves the convergence when using large minibatches. This paper provides the intuition and formal justification of Adasum along with a convergence proof. Additionally, the paper describes an efficient implementation of Adasum and its integration into the open-source toolkit Horovod for use in both TensorFlow and PyTorch.

The paper empirically shows that Adasum improves convergence when using large minibatch sizes for multiple optimizers (Momentum-SGD, Adam, and LAMB). For BERT-Large training with a minibatch size of 64K, using both Adasum and LAMB training converges in 20% fewer epochs than with LAMB alone. This combination also allows BERT-Large training to scale to a 128K minibatch size. While one of the motivations for LAMB was the inability of the Adam optimizer to scale beyond a minibatch size of 16K, we show that Adasum helps Adam scale BERT-Large training to a 64K minibatch size. Our implementation of Adasum in Horovod has already been adopted in several production environments.

Tue 6 April 16:00 - 16:20 PDT

PipeMare: Asynchronous Pipeline Parallel DNN Training

Bowen Yang · Jian Zhang · Jonathan Li · Christopher Re · Christopher Aberger · Christopher De Sa

Pipeline parallelism when training neural networks enables models to be partitioned spatially, which can lead to overall higher hardware utilization. Unfortunately, to preserve the statistical efficiency of sequential training, existing pipeline parallel training techniques sacrifice hardware efficiency by decreasing pipeline utilization or incurring extra memory costs. In this paper, we investigate to what extent these sacrifices will be necessary on the emerging class of new dataflow hardware accelerators. We devise PipeMare, a simple yet robust training method that tolerates asynchronous updates during pipeline parallel execution without sacrificing utilization or memory, which allows efficient use of fine-grained pipeline parallelism. Concretely, when tested on ResNet and Transformer networks, asynchrony enables PipeMare to use up to 2.7x less memory or get 14.3x higher pipeline utilization, with similar model quality, when compared to state-of-the-art synchronous pipeline parallel training techniques.

Tue 6 April 16:20 - 16:40 PDT

EXPLORING THE LIMITS OF CONCURRENCY IN ML TRAINING ON GOOGLE TPUS

Sameer Kumar · Yu Wang · Cliff Young · James Bradbury · Naveen Kumar · Dehao Chen · Andy Swing

Recent results in language understanding using neural networks have required training hardware of unprecedented scale, with thousands of chips cooperating on a single training run. This paper presents techniques to scale ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips. We discuss model parallelism to overcome scaling limitations from the fixed batch size in data parallelism, communication/collective optimizations, distributed evaluation of training metrics, and host input processing scaling optimizations. These techniques are demonstrated in both the TensorFlow and JAX programming frameworks. We also present performance results from the recent Google submission to the MLPerf-v0.7 benchmark contest, achieving record training times from 16 to 28 seconds in four MLPerf models on the Google TPU-v3 Multipod machine.

Tue 6 April 16:40 - 17:00 PDT

Outstanding Paper Award

TT-Rec: Tensor Train Compression for Deep Learning Recommendation Models

Chunxing Yin · Bilge Acun · Carole-Jean Wu · Xing Liu

The memory capacity of embedding tables in deep learning recommendation models (DLRMs) is increasing dramatically from tens of GBs to TBs across the industry. Given the fast growth in DLRMs, novel solutions are urgently needed in order to enable DLRM innovations. At the same time, this must be done in a fast and efficient way without having to exponentially increase infrastructure capacity demands. In this paper, we demonstrate the promising potential of Tensor Train decomposition for DLRMs (TT-Rec), an important yet under-investigated context. We design and implement optimized kernels (TT-EmbeddingBag) to evaluate the proposed TT-Rec design. TT-EmbeddingBag is 3x faster than the SOTA TT implementation. The performance of TT-Rec is further optimized with the batched matrix multiplication and caching strategies for embedding vector lookup operations. In addition, we present mathematically and empirically the effect of weight initialization distribution on DLRM accuracy and propose to initialize the tensor cores of TT-Rec following the sampled Gaussian distribution. We evaluate TT-Rec across three important design space dimensions---memory capacity, accuracy, and timing performance---by training MLPerf-DLRM with Criteo's Kaggle and Terabyte data sets. TT-Rec compresses the model size by 4x to 221x for Kaggle, with 0.03% to 0.3% loss of accuracy correspondingly. For Terabyte, our approach achieves 112x model size reduction which comes with no accuracy loss nor training time overhead as compared to the uncompressed baseline.