Skip to yearly menu bar Skip to main content


Session

Testing, Debugging and Monitoring & Security

Exhibit Hall A

Moderator: Peipei Zhou

Abstract:

Chat is not available.

Tue 30 Aug. 16:00 - 16:18 PDT

Revelio: ML-Generated Debugging Queries for Finding Root Causes in Distributed Systems

Pradeep Dogga · Karthik Narasimhan · Anirudh Sivaraman · Shiv Saini · George Varghese · Ravi Netravali

A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query that tool’s logs. Our own study of a production debugging workflow confirms the magnitude of this burden. This paper explores whether a deep neural network trained on past bug reports and debugging logs can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bug’s root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employ-ing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time. Our developer study confirms the utility of Revelio.

Tue 30 Aug. 16:18 - 16:36 PDT

Randomness in Neural Network Training: Characterizing the Impact of Tooling

Donglin Zhuang · Xingyao Zhang · Shuaiwen Song · Sara Hooker

The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, state-of-the-art networks, and open-source datasets, to characterize how tooling choices contribute to the level of non-determinism in a system, the impact of said non-determinism, and the cost of eliminating different sources of noise. Our findings suggest that the impact of non-determinism is nuanced. While top-line metrics such as top-1 accuracy are not noticeably impacted, model performance on certain parts of the data distribution is far more sensitive to the introduction of randomness. Our results suggest that deterministic tooling is critical for AI safety. However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to \textit{746\%} on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training.

Tue 30 Aug. 16:36 - 16:54 PDT

Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning

Ningning Xie · Tamara Norman · Dominik Grewe · Dimitrios Vytiniotis

We present a novel characterization of the mapping of multiple parallelism forms(e.g. data and model parallelism) onto hierarchical accelerator systems that ishierarchy-aware and greatly reduces the space of software-to-hardware mapping.We experimentally verify the substantial effect of these mappings on all-reduceperformance (up to 448x). We offer a novel syntax-guided programsynthesis framework that is able to decompose reductions over one or moreparallelism axes to sequences of collectives in a hierarchy- and mapping-awareway. For 69% of parallelism placements and user requested reductions, ourframework synthesizes programs that outperform the default all-reduceimplementation when evaluated on different GPU hierarchies (max 2.04x,average 1.27x). We complement our synthesis tool with a simulatorexceeding 90% top-10 accuracy, which therefore reduces the need for massiveevaluations of synthesis results to determine a small set of optimal programsand mappings.

Tue 30 Aug. 16:54 - 17:12 PDT

dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training

Hanpeng Hu · Chenyu Jiang · Yuchen Zhong · Yanghua Peng · Chuan Wu · Yibo Zhu · Haibin Lin · Chuanxiong Guo

Distributed training using multiple devices (i.e., GPU servers) has been widely adopted for learning DNN models over large datasets. However, the performance of large-scale distributed training tends to be far from linear speed-up in practice. Given the complexity of distributed systems, it is challenging to identify the root cause(s) of inefficiency and exercise effective performance optimizations when unexpected low training speed occurs. To date, there exists no software tool which diagnoses performance issues and helps expedite distributed DNN training, while the training can be run using different machine learning frameworks. This paper proposes dPRO, a toolkit that includes: (1) an efficient profiler that collects runtime traces of distributed DNN training across multiple frameworks, especially fine-grained communication traces, and constructs global data flow graphs including detailed communication operations for accurate replay; (2) an optimizer that effectively identifies performance bottlenecks and explores optimization strategies (from computation, communication and memory aspects) for training acceleration. We implement dPRO on multiple deep learning frameworks (PyTorch, TensorFlow, MXNet) and representative communication schemes (AllReduce and Parameter Server architecture). Extensive experiments show that dPRO predicts performance of distributed training in various settings with<5% errors in most cases and finds optimization strategies with up to87.1%speed-up over the baselines.

Tue 30 Aug. 17:12 - 17:30 PDT

A Tale of Two Models: Constructing Evasive Attacks on Edge Models

Wei Hao · Aahil Awatramani · Jiayang Hu · Chengzhi Mao · Pin-Chun Chen · Eyal Cidon · Asaf Cidon · Junfeng Yang

Full-precision deep learning models are typically too large or costly to deploy on edge devices. To accommodate to the limited hardware resources, models are adapted to the edge using various edge-adaptation techniques, such as quantization and pruning.While such techniques may have a negligible impact on top-line accuracy, the adapted models exhibit subtle differences in output compared to the original model from which they are derived.In this paper, we introduce a new evasive attack, DIVA, that exploits these differences in edge adaptation, by adding adversarial noise to input data that maximizes the output difference between the original and adapted model. Such an attack is particularly dangerous, because the malicious input will trick the adapted model running on the edge, but will be virtually undetectable by the original model, which typically serves as the authoritative model version, used for validation, debugging and retraining.We compare DIVA to a state-of-the-art attack, PGD, and show that DIVA is only 1.7--3.6% worse on attacking the adapted model but 1.9--4.2 times more likely not to be detected by the the original model under a whitebox and semi-blackbox setting, compared to PGD.