Skip to yearly menu bar Skip to main content


Timezone: US/Pacific

Registration Desk: Registration Check-in Desk Tue 30 Aug 06:30 a.m.  


Tutorial: Burak Aksar

ML-based Computer System Telemetry Analytics

This tutorial aims to introduce the audience to ML-based telemetry analytics for large-scale computing systems to improve system performance, resilience, and power efficiency. Modern large-scale computing systems (i.e., data centers, High-Performance Computing clusters, etc.) are highly parallel systems that perform numerous complex operations concurrently, and they are critical for many societal and scientific applications. These complex systems support higher degrees of parallelism, which often leads to significant resource contention and eventually to performance variability and loss of efficiency. One way to assess system performance and identify the root causes of problems is by gathering and inspecting telemetry data. Such telemetry (of hundreds or thousands of hardware and software sensors) and log data are readily
available on any computer system today. As this system data contains billions of data points per day, manual analysis is impractical and has limited benefits. Considering the limitations of manual analysis, ML is emerging as a promising approach to automate performance analytics. Also, computer system telemetry analytics is a challenging application area with many open problems since labeled data is scarcely available, whereas unlabeled data can reach up to the scale of terabytes per day.

The goal of this tutorial is twofold. First, the tutorial provides an overview of telemetry data-based analytics and shows why ML-based approaches are more promising than existing methods that can identify which applications are running on compute nodes, performance or other anomalies, and root causes of anomalies. Participants will learn and experience these materials directly during hands-on activities through the use of open-source analytics frameworks designed by the speakers' teams at Boston University and the University of Bologna. At the end of this tutorial, participants will have a better understanding of the challenges and opportunities and gain the skills needed to employ ML-based frameworks for
solving complex problems in computer systems.

Burak Aksar is a Ph.D. student in the Department of Electrical and Computer Engineering of Boston University. He received his B.S. degree in Electronics Engineering from Sabanci University, Istanbul, Turkey. His research interests are applied machine learning & explainable AI techniques to improve the performance of large-scale computing systems. He has completed successful internships at IBM AI research and Sandia National Labs.



Oral: Systems for ML 2 Tue 30 Aug 08:45 a.m.  

Samuel A. Stein · Betis Baheri · Daniel Chen · Ying Mao · Qiang Guan · Ang Li · Shuai Xu · Caiwen Ding

[ Exhibit Hall A ]

In the past decade, remarkable progress has been achieved in deep learning related systems and applications. In the post Moore’s Law era, however, the limit of semiconductor fabrication technology along with the increasing data size has slowed down the development of learning algorithms. In parallel, the rapid development of quantum computing has pushed it into a new era. Google illustrated quantum supremacy by completing a specific task (random sampling problem), in 200 seconds, which continues to be impracticable for the largest classical computers. Due to the exponential potential of quantum computing, quantum based learning is an area of interest, in hopes that certain systems might offer a quantum speedup. In this work, we propose a novel architecture QuClassi, a quantum neural network for both binary and multi-class classification. Powered by a quantum differentiation function along with a hybrid quantum-classic design, QuClassi encodes the data with a reduced number of qubits and generates the quantum circuit, pushing it to the quantum platform for the best states, iteratively. We conduct intensive experiments on both quantum simulators, IBM-Q’s quantum platform as well as evaluate performance on IonQ. The evaluation results demonstrate that QuClassi is able to outperform the state-of-the-art quantum-based solutions, Tensorflow-Quantum and …

Andrew Or · Haoyu Zhang · Michael None Freedman

[ Exhibit Hall A ]

We propose VirtualFlow, a system leveraging a novel abstraction called virtual node processing to decouple the model from the hardware. In each step of training or inference, the batch of input data is split across virtual nodes instead of hardware accelerators (e.g., GPUs and TPUs). Mapping multiple virtual nodes to each accelerator and processing them sequentially effectively time slices the batch, thereby allowing users to reduce the memory requirements of their workloads and mimic large batch sizes on small clusters. Using this technique, VirtualFlow enables many new use cases, such as reproducing training results across different hardware, resource elasticity, and heterogeneous training. In our evaluation, our implementation of VirtualFlow for TensorFlow achieved strong convergence guarantees across different hardware with out-of-the-box hyperparameters, up to 48% lower job completion times with resource elasticity, and up to 42% higher throughput with heterogeneous training.

Wasu Piriyakulkij · Cristina Menghini · Ross Briden · Nihal Vivekanand Nayak · Jeffrey Zhu · Elaheh Raisi · Stephen Bach

[ Exhibit Hall A ]

Machine learning practitioners often have access to a spectrum of data: labeled data for the target task (which is often limited), unlabeled data, and auxiliary data, the many available labeled datasets for other tasks. We describe TAGLETS, a system built to study techniques for automatically exploiting all three types of data and creating high-quality, servable classifiers. The key components of TAGLETS are: (1) auxiliary data organized according to a knowledge graph, (2) modules encapsulating different methods for exploiting auxiliary and unlabeled data, and (3) a distillation stage in which the ensembled modules are combined into a servable model. We compare TAGLETS with state-of-the-art transfer learning and semi-supervised learning methods on four image classification tasks. Our study covers a range of settings, varying the amount of labeled data and the semantic relatedness of the auxiliary data to the target task. We find that the intelligent incorporation of auxiliary and unlabeled data into multiple learning techniques enables TAGLETS to match---and most often significantly surpass---these alternatives. TAGLETS is available as an open-source system at github.com/anonymous.

Zhiming Hu · Ning Ye · Iqbal Mohomed

[ Exhibit Hall A ]

We study the problem of natural language-based video retrieval, the task of finding relevant videos given natural language search queries. Most recent state-of-the-art (SOTA) approaches would embed the video and query separately and map the video and query embeddings into a joint latent space to calculate a similarity score between them. To learn a video representation, existing solutions generally use all the frames or sample a subset of frames from the video using uniform sampling. The former solution could be computationally prohibitive while the latter may inject noise from uninformative frames into the final video representation. To this end, we propose mmSampler, a learning-based sampler, to adaptively select salient frames to represent the videos for multimodal video retrieval. mmSampler can greatly reduce the computational overhead for video representation without affecting the retrieval performance. We learn a lightweight policy network to decide whether to further process or discard a frame. By adopting the Gumbel-Softmax trick, we train the sampler jointly with the video retrieval model end-to-end in an efficient manner. Experimental results on benchmark datasets such as ActivityNet, DiDeMo and MSRVTT demonstrate that mmSampler achieves improved retrieval performance while saving as much as 43% GFLOPs per video.

Carole-Jean Wu · Ramya Raghavendra · Udit Gupta · Bilge Acun · Newsha Ardalani · Kiwan Maeng · Gloria Chang · Fiona Aga · Jinshi Huang · Charles Bai · Michael Gschwind · Anurag Gupta · Myle Ott · Anastasia Melnikov · Salvatore Candido · David Brooks · Geeta Chauhan · Benjamin Lee · Hsien-Hsin Lee · Bugra Akyildiz · Maximilian Balandat · Joe Spisak · Ravi Jain · Mike Rabbat · Kim Hazelwood

[ Exhibit Hall A ]

This paper explores the environmental impact of the super-linear growth trends for AI from a holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the carbon footprint of AI computing by examining the model development cycle across industry-scale machine learning use cases and, at the same time, considering the life cycle of system hardware. Taking a step further, we capture the operational and manufacturing carbon footprint of AI computing and present an end-to-end analysis for what and how hardware-software design and at-scale optimization can help reduce the overall carbon footprint of AI. Based on the industry experience and lessons learned, we share the key challenges and chart out important development directions across the many dimensions of AI. We hope the key messages and insights presented in this paper can inspire the community to advance the field of AI in an environmentally-responsible manner.


Invited Talk: Dawn Song

Towards Building a Responsible Data Economy

Data is a key driver of modern economy and AI/machine learning, however, a lot of this data is sensitive and handling the sensitive data has caused unprecedented challenges for both individuals and businesses, and these challenges will only get more severe as we move forward in the digital era. In this talk, I will talk about technologies needed for responsible data use including secure computing, differential privacy, federated learning, as well as blockchain technologies for data rights, and how to combine privacy computing technologies and blockchain to building a platform for a responsible data economy, to enable the creation of a new type of asset, i.e., data assets, more responsible use of data and fair distribution of value created from data.

Dawn Song

 

Dawn Song is a Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley. Her research interest lies in AI and deep learning, security and privacy. She is the recipient of various awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She is an ACM Fellow and an IEEE Fellow. She is ranked the most cited scholar in computer security (AMiner Award). She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur. She is the Founder of Oasis Labs and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.



Round Table Discussion Tue 30 Aug 11:30 a.m.  

We plan roundtable discussions on Tuesday to connect early career professionals attending MLSys with senior MLSys conference attendee(s). When you sign up, we will group early career professionals at a reserved lunch table where you can meet new people and an assigned senior mentor at your table! This is an informal, group mentee-mentor event that lowers the barrier for young professionals starting their careers in the MLSys community.

Signup here


Oral: ML for Systems Tue 30 Aug 01:00 p.m.  

Ankur Mallick · Kevin Hsieh · Behnaz Arzani · Gauri Joshi

[ Exhibit Hall A ]

Today's data centers rely more heavily on machine learning (ML) in their deployed systems. However, these systems are vulnerable to the data drift problem, that is, a mismatch between training and test data, which can lead to significant performance degradation and system inefficiencies. In this paper, we demonstrate the impact of data drift in production by studying two real-world deployments in a leading cloud provider. Our study shows that, despite frequent model retraining, these deployed models experience major accuracy drops (up to 40%) and high accuracy variation, which lead to drastic increase in operational costs. None of the current solutions to the data drift problem are designed for large-scale deployments, which need to address real-world issues such as scale, ground truth latency, and mixed types of data drift. We propose Matchmaker, the first scalable, adaptive, and flexible solution to the data drift problem in large-scale production systems. Matchmaker finds the most similar training data batch and uses the corresponding ML model for inference on each test point. As part of Matchmaker, we introduce a novel similarity metric to address multiple types of data drifts while only incurring limited overhead. Experiments on our two real-world ML deployments show matchmaker significantly improve …

Xinfeng Xie · Prakash Prabhu · Ulysse Beaugnon · Mangpo Phothilimthana · Sudip Roy · Azalia Mirhoseini · Eugene Brevdo · James Laudon · Yanqi Zhou

[ Exhibit Hall A ]

Multi-Chip-Modules (MCMs) reduce the design and fabrication cost of machine learning (ML) accelerators while delivering performance and energy efficiency on par with a monolithic large chip. However, ML compilers targeting MCMs need to solve complex optimization problems optimally and efficiently to achieve this high performance. One such problem is the multi-chip partitioning problem where compilers determine the optimal partitioning and placement of operations in tensor computation graphs on chiplets in MCMs. Partitioning ML graphs for MCMs is particularly hard as the search space grows exponentially with the number of chiplets available and the number of nodes in the neural network. Furthermore, the constraints imposed by the underlying hardware produce a search space where valid solutions are extremely sparse. In this paper, we present a strategy using a deep reinforcement learning (RL) framework to emit a possibly invalid candidate partition that is then corrected by a constraint solver. Using the constraint solver ensures that RL encounters valid solutions in the sparse space frequently enough to converge with fewer samples as compared to non-learned strategies. The graphical neural network and sequential attention mechanism in our RL framework enable the generalization across different ML graphs. Our evaluation of a production-scale model, BERT, on …

Junguk Cho · Diman Zad Tootaghaj · Lianjie Cao · Puneet Sharma

[ Exhibit Hall A ]

The current design of Serverless computing frameworks assumes that all the requests and underlying compute hardware are homogeneous. This homogeneity assumption causes two challenges in running ML workloads like Deep Neural Network (DNN) inference services on these frameworks. Such workloads can have various request types and might require heterogeneous accelerators. First, existing serverless frameworks are threshold-based and use simple query per second or CPU utilization as autoscaling rules, thus ignoring heterogeneous requests and accelerators, resulting in sub-optimal performance. Second, ignoring infrastructure heterogeneity for workload scheduling and inference request distribution can lead to further performance inefficiencies. To address these challenges, we propose SLA-aware ML Inference Framework, which is a novel application and hardware-aware serverless computing framework to manage ML (\eg, DNN) inference applications in a heterogeneous infrastructure. Our framework designs an intelligent autoscaling strategy by leveraging rich, precise workload-specific metrics and heterogeneous GPU compute capability. We schedule functions on the suitable GPU accelerators and proportionally distribute inference requests to the deployed functions based on the autoscaling decision. In addition, our framework enables efficient shares of GPU accelerators with multiple functions to increase resource efficiency with minimal overhead. Unlike prior works, we use application-specific SLA metrics to make scheduling/autoscaling decisions. We implement …

Yi Ding · Avinash Rao · Hyebin Song · Rebecca Willett · Henry (Hank) Hoffmann

[ Exhibit Hall A ]

Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers---rare, yet extremely slow tasks---are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels---i.e., sufficient examples of all possible behaviors, including straggling and non-straggling---or strong assumptions about the underlying latency distributions---e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early without labeled positive examples or assumptions on latency distributions, this paper presents NURD, a novel Negative-Unlabeled learning approach with Reweighting and Distribution-compensation that only trains on negative and unlabeled streaming data. The key idea is to train a predictor using finished tasks of non-stragglers to predict latency for unlabeled running tasks, and then reweight each unlabeled task's prediction based on a weighting function of its feature space. We evaluate NURD on two production traces from Google and Alibaba, and find that compared to …


Tutorial: Mert Hidayetoglu · Jinjun Xiong · Wen-Mei Hwu · Rakesh Nagi · Vikram Sharma Mailthody · Jeff Pool · Sitao Huang

Sparsity in ML: Understanding and Optimizing Sparsity in Neural Networks Running on Heterogeneous Systems

A plethora of ML models are either sparsified (such as deep neural networks) to save memory footprint and FLOPs or inherently sparse due to their unstructured nature (such as graph neural networks). Nevertheless, even though sparsity is desired in theory, it often hampers the performance in practice because existing heterogeneous systems (such as GPUs and FPGAs) fall short in irregular computations. For example, as the GPU architectures are optimized for regular, dense computations, only a tiny portion of the theoretical GPU performance is realized when performing sparse computation. In this tutorial, we discuss the source of sparsity in deep neural networks as well as key techniques for mapping the sparse computation on heterogeneous systems to support high-performance inference and training. We will conclude this tutorial with a discussion on future work on model parallelism for optimizing sparse communications for large-scale sparse ML models.

Mert Hidayetoglu

 

Mert recently obtained PhD in electrical and computer engineering at the University of Illinois at Urbana-Champaign. His research is at the intersection of theory and applications of large-scale high-performance computing and software systems for exascale computing. His work focuses on sparse and unstructured data accesses, communications, and computations on heterogeneous system architectures involving multi-GPU nodes. He was a Givens fellow with Argonne National Laboratory and a member of IBM-Illinois Center for Cognitive Computing Research. He is a recipient of the SC20 Best Paper Award and HPEC’20 Graph Challenge Champion awards, and ACM-IEEE CS George Michael Memorial HPC Fellowship in 2021. He will join Stanford as a postdoctoral scholar. See more about Mert's research <a href="https://merthidayetoglu.github.io">here.</a>
Jinjun Xiong

 

Dr. Jinjun Xiong is currently Empire Innovation Professor with the Department of Computer Science and Engineering at University at Buffalo (UB). Prior to that, he was a Senior Researcher and Program Director for AI and Hybrid Clouds Systems at the IBM Thomas J. Watson Research Center. He co-founded and co-directed the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) from 2016-2021, the success of which led to the $200M 10-year investment to establish the IBM-Illinois Discovery Accelerator Institute in 2021. His research interests are on across-stack AI systems research, which include AI applications, algorithms, tooling, and computer architectures.
Wen-Mei Hwu

 

Wen-mei Hwu is currently a Senior Distinguished Research Scientist with NVIDIA. Prior to NVIDIA, he was a Professor, Sanders-AMD endowed Chair, Acting Department Head, and Chief Scientist of the Parallel Computing Institute at the University of Illinois at Urbana-Champaign. He and his Illinois team developed the superblock compiler scheduling and optimization framework that has been adopted by virtually all modern vendors and open-source compilers today. For his research contributions, he received the ACM SigArch Maurice Wilkes Award, ACM Grace Murray Hopper Award, IEEE Computer Society Charles Babbage Award, ISCA Influential Paper Award, MICRO Test-of-Time Award, IEEE Computer Society B. R. Rau Award, CGO Test-of-Time Award and Distinguished Alumni Award in CS of the University of California, Berkeley. He has also won numerous best paper awards for major conferences. He is a Fellow of the ACM and IEEE.
Rakesh Nagi

 

Rakesh Nagi is Donald Biggar Willett Professor of Engineering at the University of Illinois, Urbana-Champaign. He served as the Department Head of Industrial and Enterprise Systems Engineering (2013-2019). He also served as the Interim Director of the Illinois Applied Research Institute (2016 –2018). He is an affiliate faculty in Computer Science, Electrical and Computer Engineering, Coordinated Science Laboratory, and Computational Science and Engineering. Previously he served as the Chair (2006-2012) and Professor of Industrial and Systems Engineering at the University at Buffalo (SUNY) (1993-2013). He is a recipient of IISE David F. Baker Distinguished Research Award (2022), INFORMS Koopman Award from Military Application Society (2021, 2018), DARPA Graph Challenge, Champion (2020), Honorable Mention (2017, 2019), Finalist (2018), Innovation Award (2018, 2019), IIE Transactions on Design and Manufacturing, Best paper award from journal issues from July 2011 through June 2012 (2013), IIE Fellow Award (2010), UB’s “Sustained Achievement Award” in recognition of outstanding achievements in scholarly activity (2009), Business First of Buffalo’s “40 under Forty” award (2004), SME's Milton C. Shaw Outstanding Young Manufacturing Engineer Award (1999), IIE's Outstanding Young Industrial Engineer Award in Academia (1999), and National Science Foundation's CAREER Award (1996). Dr. Nagi's recent research interests are in GPU-accelerated algorithms for discrete optimization and graph analytics.
Vikram Sharma Mailthody

 

Vikram is an incoming Research Scientist at NVIDIA Research starting in Fall 2022. Vikram completed his Ph.D. at the Electrical and Computer Engineering Department of the University of Illinois, Urbana Champaign. Vikram is interested in solving fundamental systems-level problems and proposing optimizations for emerging applications. He has extensive experience in memory and storage system design, GPUs, and performance optimizations for emerging applications like NLP, recommender systems, GNNs, and graph and data analytics. Vikram is a recipient of the Bahl Fellowship for 2019–21, the Dan Vivoli Endowed Fellowship for 2021–22, and has won several competitions, including the Championship award for the HPEC’20 Graph Challenge and multiple student innovation awards in HPEC Graph Challenge competitions.
Jeff Pool is a Senior Architect at NVIDIA. Prior to joining NVIDIA, he completed his Ph.D. and M.S. in Computer Science at the University of Chapel Hill in 2012 and 2009 respectively, focusing on power-efficient graphics hardware. Since then, he’s been improving the power and performance of GPUs, researching neural network sparsity, and designing and building accelerators that exploit network sparsity to achieve practical speedups.
Sitao Huang

 

Sitao Huang is an assistant professor at the Department of Electrical Engineering and Computer Science in University of California, Irvine. He received his Ph.D. degree and M.S. degree in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign in 2021 and 2017 respectively. He received his B.Eng. degree in Electronics Engineering at Tsinghua University in 2014. Sitao’s research interests include highly efficient hardware acceleration, programming language and synthesis flow for hardware systems, and optimization of heterogeneous systems. He is a recipient of 2019 Sundaram Seshu International Student Fellowship and 2018 Rambus Computer Engineering Fellowship. His research won several awards, including the Best Paper Award at IDEAL 2021, Best Paper Nomination at ASP-DAC 2021, the Student Innovation Award at the 2018 IEEE HPEC Graph Challenge, and the first place at DAC 2019 System Design Contest.



Oral: Hardware Efficient ML Tue 30 Aug 02:15 p.m.  

Kartikeya Bhardwaj · Milos Milosavljevic · Liam O'Neil · Dibakar Gope · Ramon Matas · Alex Chalfin · Alex Chalfin · Naveen Suda · Naveen Suda · Lingchuan Meng · Lingchuan Meng · Danny Loh · Danny Loh

[ Exhibit Hall A ]

With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose Super-Efficient Super Resolution (SESR) networks that establish a new state-of-the-art for efficient super resolution. Our approach is based on linear overparameterization of CNNs and creates an efficient model architecture for SISR. With theoretical analysis, we uncover the limitations of existing overparameterization methods and show how the proposed method alleviates them. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 (1080p to 8K) SISR. Towards this, we estimate hardware performance numbers for a commercial Arm mobile-Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster (e.g., 6x-8x higher FPS) than existing models on mobile-NPU. Finally, SESR outperforms prior models by 1.5x-2x in …

Yanqi Zhou · Xuanyi Dong · Tianjian Meng · Mingxing Tan · Berkin Akin · Daiyi Peng · Amir Yazdanbakhsh · Da Huang · Ravi Narayanaswami · James Laudon

[ Exhibit Hall A ]

Better neural architectures and new hardware accelerators are two driving forces for the progress in deep learning. Previous works typically focus on one aspect: they either design new neural architectures for fixed hardware like GPUs or customize hardware (often on FPGAs) for a fixed set of neural models like ResNets or Transformers. In this work, we aim to jointly optimize neural architecture and hardware configurations for Google's Edge TPUs. Through extensive studies, we observe that: 1) the neural architecture search space has to be customized to fully leverage the targeted hardware, 2) neural architecture and hardware accelerator should be jointly searched to achieve the best of both worlds, and 3) conventional metrics such as FLOPs and parameter size often do not well represent model efficiency in real accelerators. Our experiments show that our joint search approach, named NaaS, consistently outperforms previous state-of-the-art results, such as EfficientNet, on both image classification and segmentation tasks. Furthermore, our approach reduces energy consumption by up to 2x under the same accuracy on Edge TPUs.

Saurabh Agarwal · Hongyi Wang · Shivaram Venkataraman · Dimitris Papailiopoulos

[ Exhibit Hall A ]

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent research proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 realistic distributed setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide meaningful utility.

Seo Jin Park · Joshua Fried · Sunghyun Kim · Mohammad Alizadeh · Adam Belay

[ Exhibit Hall A ]

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data …


Oral: Testing, Debugging and Monitoring & Security Tue 30 Aug 04:00 p.m.  

Pradeep Dogga · Karthik Narasimhan · Anirudh Sivaraman · Shiv Saini · George Varghese · Ravi Netravali

[ Exhibit Hall A ]

A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query that tool’s logs. Our own study of a production debugging workflow confirms the magnitude of this burden. This paper explores whether a deep neural network trained on past bug reports and debugging logs can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bug’s root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employ-ing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of …

Donglin Zhuang · Xingyao Zhang · Shuaiwen Song · Sara Hooker

[ Exhibit Hall A ]

The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, state-of-the-art networks, and open-source datasets, to characterize how tooling choices contribute to the level of non-determinism in a system, the impact of said non-determinism, and the cost of eliminating different sources of noise. Our findings suggest that the impact of non-determinism is nuanced. While top-line metrics such as top-1 accuracy are not noticeably impacted, model performance on certain parts of the data distribution is far more sensitive to the introduction of randomness. Our results suggest that deterministic tooling is critical for AI safety. However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to \textit{746\%} on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training.

Ningning Xie · Tamara Norman · Dominik Grewe · Dimitrios Vytiniotis

[ Exhibit Hall A ]

We present a novel characterization of the mapping of multiple parallelism forms(e.g. data and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and greatly reduces the space of software-to-hardware mapping.We experimentally verify the substantial effect of these mappings on all-reduceperformance (up to 448x). We offer a novel syntax-guided programsynthesis framework that is able to decompose reductions over one or moreparallelism axes to sequences of collectives in a hierarchy- and mapping-awareway. For 69% of parallelism placements and user requested reductions, ourframework synthesizes programs that outperform the default all-reduceimplementation when evaluated on different GPU hierarchies (max 2.04x,average 1.27x). We complement our synthesis tool with a simulatorexceeding 90% top-10 accuracy, which therefore reduces the need for massiveevaluations of synthesis results to determine a small set of optimal programsand mappings.

Hanpeng Hu · Chenyu Jiang · Yuchen Zhong · Yanghua Peng · Chuan Wu · Yibo Zhu · Haibin Lin · Chuanxiong Guo

[ Exhibit Hall A ]

Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.

Wei Hao · Aahil Awatramani · Jiayang Hu · Chengzhi Mao · Pin-Chun Chen · Eyal Cidon · Asaf Cidon · Junfeng Yang

[ Exhibit Hall A ]

Full-precision deep learning models are typically too large or costly to deploy on edge devices. To accommodate to the limited hardware resources, models are adapted to the edge using various edge-adaptation techniques, such as quantization and pruning.While such techniques may have a negligible impact on top-line accuracy, the adapted models exhibit subtle differences in output compared to the original model from which they are derived.In this paper, we introduce a new evasive attack, DIVA, that exploits these differences in edge adaptation, by adding adversarial noise to input data that maximizes the output difference between the original and adapted model. Such an attack is particularly dangerous, because the malicious input will trick the adapted model running on the edge, but will be virtually undetectable by the original model, which typically serves as the authoritative model version, used for validation, debugging and retraining.We compare DIVA to a state-of-the-art attack, PGD, and show that DIVA is only 1.7--3.6% worse on attacking the adapted model but 1.9--4.2 times more likely not to be detected by the the original model under a whitebox and semi-blackbox setting, compared to PGD.