Moderator: Gennady Pekhimenko
James Gleeson · Srivatsan Krishnan · Moshe Gabel · Vijay Janapa Reddi · Eyal de Lara · Gennady Pekhimenko
Deep reinforcement learning (RL) has made groundbreaking advancements in robotics, data center management and other applications. Unfortunately, system-level bottlenecks in RL workloads are poorly understood; we observe fundamental structural differences in RL workloads that make them inherently less GPU-bound than supervised learning (SL). To explain where training time is spent in RL workloads, we propose RL-Scope, a cross-stack profiler that scopes low-level CPU/GPU resource usage to high-level algorithmic operations, and provides accurate insights by correcting for profiling overhead. Using RL-Scope, we survey RL workloads across its major dimensions including ML backend, RL algorithm, and simulator. For ML backends, we explain a 2.3× difference in runtime between equivalent PyTorch and TensorFlow algorithm implementations, and identify a bottleneck rooted in overly abstracted algorithm implementations. For RL algorithms and simulators, we show that on-policy algorithms are at least 3.5× more simulation-bound than off-policy algorithms. Finally, we profile a scale-up workload and demonstrate that GPU utilization metrics reported by commonly used tools dramatically inflate GPU usage, whereas RL-Scope reports true GPU-bound time. RL-Scope is an open-source tool available at https://github.com/UofT-EcoSystem/rlscope.
Sam Kaufman · Phitchaya Phothilimthana · Yanqi Zhou · Charith Mendis · Sudip Roy · Amit Sabne · Mike Burrows
Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks—tile-size selection and operator fusion—and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
Xavier Bouthillier · Pierre Delaunay · Mirko Bronzi · Assya Trofimov · Brennan Nichyporuk · Justin Szeto · Nazanin Mohammadi Sepahvand · Edward Raff · Kanika Madan · Vikram Voleti · Samira Ebrahimi Kahou · Vincent Michalski · Tal Arbel · Chris Pal · Gael Varoquaux · Pascal Vincent
Strong empirical evidence that one machine-learning algorithm A outperforms another one B, ideally calls for multiple trials optimizing the learning pipeline over sources of variation such as data sampling, augmentation, parameter initialization, and hyperparameters choices. This is prohibitively expensive, and corners are cut to reach conclusions. We model the whole benchmarking process and all sources of variation, revealing that variance due to data sampling, parameter initialization and hyperparameter choice impact markedly machine learning benchmark. We analyze the predominant comparison methods used today in the light of this variance. We show a counter-intuitive result that a biased estimator with more source of variation will give better results, closer to the ideal estimator at a 51× reduction in compute cost. Using this we perform a detailed study on the error rate of detecting improvements, on five different deep-learning tasks/architectures. This study leads us to propose recommendations for future performance comparisons.
Tom Bannink · Adam Hillier · Lukas Geiger · Tim de Bruin · Leon Overweel · Jelmer Neeven · Koen Helwegen
We introduce Larq Compute Engine (LCE), a state-of-the-art Binarized Neural Network (BNN) inference engine, and use this framework to investigate several important questions about the efficiency of BNNs and to design a new leading BNN architecture. LCE provides highly optimized implementations of binary operations and accelerates binary convolutions by 8.5 - 18.5x compared to their full-precision counterparts on Pixel 1 phones. LCE's integration with Larq and a sophisticated MLIR-based converter allow users to move smoothly from training to deployment. By extending TensorFlow and TensorFlow Lite, LCE supports models which combine binary and full-precision layers, and can be easily integrated into existing applications. Using LCE, we analyze the performance of existing BNN computer vision architectures and develop QuickNet, a simple, easy-to-reproduce BNN that outperforms existing binary networks in terms of latency and accuracy on ImageNet. Furthermore, we investigate the impact of full-precision shortcuts and the relationship between number of multiply-accumulate operations and model latency. We are convinced that empirical performance should drive BNN architecture design and hope this work will facilitate others to design, benchmark and deploy binary models.