Moderators: Carole-Jean Wu · Yuejie Chi
Sudip Roy · Jeff Dean · Sanjay Ghemawat · Ryan Sepassi · Hyeontaek Lim · Michael Isard · Paul Barham · Yonghui Wu · Laurent Shafey · Aakanksha Chowdhery · Chandu Thekkath · Brennan Saeta · Parker Schuh · Daniel Hurt · Ruoming Pang · Steven Hand
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures, and efficiently gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects. Pathways makes use of a novel asynchronous distributed dataflow design that lets the control plane execute in parallel despite dependencies in the data plane. This design, with careful engineering, allows Pathways to adopt a single-controller model that makes it easier to express complex new parallelism patterns. We demonstrate that Pathways can achieve performance parity (~100% accelerator utilization) with state-of-the-art systems when running SPMD computations over 2048 TPUs, while also delivering throughput comparable to the SPMD case for Transformer models that are pipelined across 16 stages, or sharded across two islands of accelerators connected over a data center network.
Zirui Xu · Zirui Xu · Fuxun Yu · Jinjun Xiong · Jinjun Xiong · Xiang Chen
The significant success of Deep Neural Networks (DNNs) is highly promoted by the multiple sophisticated DNN libraries. On the contrary, although some work have proved that Quadratic Deep Neuron Networks (QDNNs) show better non-linearity and learning capability than the traditional first-order DNNs, their neuron design suffers certain drawbacks from theoretical performance to practical deployment. In this paper, we first proposed a new QDNN neuron architecture design, and further developed QuadraLib, a QDNN library to provide architecture optimization and design exploration for QDNNs. Extensive experiments show that our design has better performance regarding prediction accuracy and computation consumption on multiple learning tasks.
Aditya Desai · Li Chou · Anshumali Shrivastava
Deep learning for recommendation data is one of the most pervasive and challenging AI workload in recent times. State-of-the-art recommendation models are one of the largest models matching the likes of GPT-3 and Switch Transformer. Challenges in deep learning recommendation models (DLRM) stem from learning dense embeddings for each of the categorical tokens. These embedding tables in industrial scale models can be as large as hundreds of terabytes. Such large models lead to a plethora of engineering challenges, not to mention prohibitive communication overheads, and slower training and inference times. Of these, slower inference time directly impacts user experience. Model compression for DLRM is gaining traction and the community has recently shown impressive compression results. In this paper, we present Random Offset Block Embedding Array (ROBE) as a low memory alternative to embedding tables which provide orders of magnitude reduction in memory usage while maintaining accuracy and boosting execution speed. ROBE is a simple fundamental approach in improving both cache performance and the variance of randomized hashing, which could be of independent interest in itself. We demonstrate that we can successfully train DLRM models with same accuracy while using $1000 \times$ less memory. A $1000\times$ compressed model directly results in faster inference without any engineering effort. In particular, we show that we can train DLRM model using ROBE array of size 100MB on a single GPU to achieve AUC of 0.8025 or higher as required by official MLPerf CriteoTB benchmark DLRM model of 100GB while achieving about $3.1\times$ (209\%) improvement in inference throughput.
Hang Qiu · Ioanna Vavelidou · Jian Li · Evgenya Pergament · Pete Warden · Sandeep Chinchali · Zain Asgar · Sachin Katti
Benefited from expanding cloud infrastructure, today's neural networks have increasingly high performance trained on the cloud. Model researchers spent months of sweat competing for an extra few percent of model accuracy. However, when these models are actually deployed on edge devices in practice, very often, the performance is dropping over 10% all of a sudden without obvious reasons. The key challenge is that there is not much visibility to ML inference execution on edge devices, and very little awareness of potential issues during the edge deployment process. ML-EXray provides visibility into layer-level details of the ML execution, helps developers analyze and debug cloud-to-edge deployment issues. More often than not, the reason does not only lie in the model itself, but every operation throughout the data flow and the deployment process. Evaluations show that ML-EXray can effectively catch deployment issues, such as pre-processing bugs, quantization issues, suboptimal kernels; using ML-EXray, users need to write less than 15 line of code to fully examine the edge deployment pipeline; eradicating these issues, ML-EXray can correct model performance by up to 30%, pinpoint error-prune layers, guide users to optimize kernel execution latency by two orders of magnitude. Code and APIs will be released as a multi-lingual instrumentation library and a Python deployment validation library.
Corey Nolet · Divye Gala · Edward Raff · Joe Eaton · Brad Rees · Tim Oates
High-performance primitives for mathematical operations on sparse vectors must deal with the challenges of skewed degree distributions and limits on memory consumption that are typically not issues in dense operations. We demonstrate that a sparse semiring primitive can be flexible enough to support a wide range of critical distance measures while maintaining performance and memory efficiency on the GPU. We further show that this primitive is a foundational component for enabling many neighborhood-based information retrieval and machine learning algorithms to accept sparse input. To our knowledge, this is the first work aiming to unify the computation of several critical distance measures on the GPU under a single flexible design paradigm and we hope that it provides a good baseline for future research in this area. Our implementation is fully open source and publicly available as part of the RAFT library of GPU-accelerated machine learning primitives (https://github.com/rapidsai/raft).