Moderator: Martin Maas
Vijay Janapa Reddi · David Kanter · Peter Mattson · Jared Duke · Thai Nguyen · Ramesh Chukka · Ken Shiring · Koan-Sin Tan · Mark Charlebois · William Chou · Mostafa El-Khamy · Jungwook Hong · Tom St John · Cindy Trinh · Michael Buch · Mark Mazumder · Relja Markovic · Thomas Atta · Fatih Cakir · Masoud Charkhabi · Xiaodong Chen · Cheng-Ming Chiang · Dave Dexter · Terry Heo · Guenther Schmuelling · Maryam Shabani · Dylan Zika
This paper presents the first industry-standard open-source machine learning (ML) benchmark to allow performance and accuracy evaluation of mobile devices with different AI chips and software stacks. The benchmark draws from the expertise of leading mobile-SoC vendors, ML-framework providers, and model producers. It comprises a suite of models that operate with standard data sets, quality metrics and run rules. We describe the design and implementation of this domain-specific ML benchmark. The current benchmark version comes as a mobile app for different computer vision and natural language processing tasks. The benchmark also supports non-smartphone devices, such as laptops and mobile PCs. Benchmark results from the first two rounds reveal the overwhelming complexity of the underlying mobile ML system stack, emphasizing the need for transparency in mobile ML performance analysis. The results also show that the strides being made all through the ML stack improve performance. Within six months, offline throughput improved by 3x, while latency reduced by as much as 12x. ML is an evolving field with changing use cases, models, data sets and quality targets. MLPerf Mobile will evolve and serve as an open-source community framework to guide research and innovation for mobile AI.
Haotian Tang · Zhijian Liu · Xiuyu Li · Yujun Lin · Song Han
Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware, and existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: irregular computation and data movement. It adopts adaptive MM grouping to trade computation for better regularity, achieving 1.4-1.5x speedup for matrix multiplication. It also optimizes the data movement by adopting vectorized, quantized and fused locality-aware memory access, reducing the memory movement cost by 2.7x. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6x and 1.5x measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
Runsheng Guo · Victor Guo · Antonio Kim · Josh Hildred · Khuzaima Daudjee
Deep Neural Networks (DNNs) are often trained in parallel on a cluster of virtual machines (VMs) so as to reduce training time. However, this requires explicit cluster management, which is cumbersome and often results in costly overprovisioning of resources. Training DNNs on serverless compute is an attractive alternative that is receiving growing interest. In a serverless environment, users do not need to handle cluster management and can scale compute resources at a fine-grained level while paying for resources only when actively used. Despite these potential benefits, existing serverless systems for DNN training are ineffective because they are limited to CPU-based training and bottlenecked by expensive distributed communication. We present Hydrozoa, a system that trains DNNs on serverless containers with a hybrid-parallel architecture that flexibly combines data- and model-parallelism. Hydrozoa supports GPU-based training and leverages hybrid-parallelism and serverless resource scaling to achieve up to 155.5x and 5.4x higher throughput-per-dollar compared to existing serverless and VM-based training systems. Hydrozoa also allows users to implement dynamic worker-scaling policies during training. We show that dynamic worker scaling improves statistical training efficiency and reduces training costs.
Michael Kuchnik · Ana Klimovic · Jiri Simsa · Virginia Smith · George Amvrosiadis
Input pipelines, which ingest and transform input data, are an essential part of training Machine Learning (ML) models. However, it is challenging to implement efficient input pipelines, as it requires reasoning about parallelism, asynchrony, and variability in fine-grained profiling information. Our analysis of over two million ML jobs in Google datacenters reveals that a significant fraction of model training jobs could benefit from faster input data pipelines. At the same time, our analysis indicates that most jobs do not saturate host hardware, pointing in the direction of software-based bottlenecks. Motivated by these findings, we propose Plumber, a tool for finding bottlenecks in ML input pipelines. Plumber uses an extensible and interpretable operational analysis analytical model to automatically tune parallelism, prefetching, and caching under host resource constraints. Across five representative ML pipelines, Plumber obtains speedups of up to 47x for misconfigured pipelines. By automating caching, Plumber obtains end-to-end speedups of over 50% compared to state-of-the-art tuners.
Zhiqiang Xie · Minjie Wang · Zihao Ye · Zheng Zhang · Rui Fan
Graph neural networks (GNNs) are a new class of powerful machine learning models, but easy programming and efficient computing is often at odds. Current GNN frameworks are based on a message passing paradigm, and allow the concise expression of GNN models using built-in primitives and user defined functions (UDFs). While built-in primitives offer high performance, they are limited in expressiveness; UDFs are flexible, but often have low performance and use excessive memory. In this paper, we propose Graphiler, a compiler stack for GNNs which achieves high performance while offering the flexibility of the UDF programming interface. At the core of Graphiler is a novel abstraction called Message Passing Data Flow Graph (MP-DFG), which enables optimizations that substantially reduce computational redundancy and memory footprint, and optimizes both homogeneous and heterogeneous GNNs under a unified framework. Experiments show Graphiler can accelerate UDF GNNs by up to two orders of magnitude, and achieve performance close to or superior to expert implementations, and do so with substantial memory savings.