Registration Desk: Registration Check-in Desk Wed 31 Aug 06:30 a.m.
Tutorial: Mert TOSLALI
Online Experimentation for Cloud Applications
The need to deliver code changes to production systems to satisfy new requirements has fueled the adoption of an agile software development practice called onlineexperimentation. Online experimentation provides insight into the value delivered by new application versions as they are exposed to users.
To solve the online experimentation problem for web and mobile applications, practitioners use A/B tests or more advanced methods such as multi-armed bandit algorithms. These approaches entail comparing and assessing application versions online to determine the best version based on business requirements such as user-engagement. However, existing techniques and their formulations do not capture the unique complexities in cloud systems.
When assessing the outcomes of releases of microservices or machine learning (ML) models in the cloud, practitioners must simultaneously consider application performance as well as business metrics. This difference arises because a cloud application’s behavior is inherently volatile due to an increased likelihood of performance bugs or variability, which can degrade desired business results. For example, Amazon reported that every 100ms of latency costs them 1% in sales. As a result of these complexities, the deployment of cloud applications is more art than science when contrasted with the approaches adopted in the web and mobile domains. However, practitioners lack rigorous solutions for code releases, making it difficult to automatically learn and optimize for both business metrics and application performance with statistical guarantees.
This tutorial aims to provide a new perspective to rethink online experimentation in the cloud era. The tutorial will study the field of online experimentation and popular existing approaches, address their shortcomings in the cloud, and discuss key challenges and requirements for real-world solutions. Participants will get a chance to craft and run an
online experiment on an open-source system designed for online experimentation of microservices and ML models deployed on the cloud.

Oral: Efficient Training Wed 31 Aug 08:45 a.m.
[ Exhibit Hall A ]
[ Exhibit Hall A ]
Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks.In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.
[ Exhibit Hall A ]
Improving the training and inference performance of graph neural networks (GNNs) is faced with a challenge uncommon in general neural networks: creating mini-batches requires a lot of computation and data movement due to the exponential growth of multi-hop graph neighborhoods along network layers. Such a unique challenge gives rise to a diverse set of system design choices. We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment, under which we identify major performance bottlenecks hitherto under-explored by developers: mini-batch preparation and transfer. We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler, a shared-memory parallelization strategy, and the pipelining of batch transfer with GPU computation. We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised. Such an observation unifies training and inference, simplifying model implementation. We report comprehensive experimental results with several benchmark data sets and GNN architectures, including a demonstration that, for the ogbn-papers100M data set, our system SALIENT achieves a speedup of 3x over a standard PyTorch-Geometric implementation with a single GPU and a further 8x parallel speedup with 16 GPUs. Therein, training a 3-layer …
[ Exhibit Hall A ]
The evaluation of hyperparameters, neural architectures, or data augmentation policies becomes a critical problem in advanced deep model training with a large hyperparameter search space. In this paper, we propose an efficient and robust bandit-based algorithm called Sub-Sampling (SS) in the scenario of hyperparameter search evaluation and its modified version for high GPU usage. It evaluates the potential of hyperparameters by the sub-samples of observations and is theoretically proved to be optimal under the criterion of cumulative regret. We further combine SS with Bayesian Optimization and develop a novel hyperparameter optimization algorithm called BOSS. Empirical studies validate our theoretical arguments of SS and demonstrate the superior performance of BOSS on a number of applications, including Neural Architecture Search (NAS), Data Augmentation (DA), Object Detection (OD), and Reinforcement Learning (RL).
[ Exhibit Hall A ]
This paper proposes gyro dropout, a variant of dropout that improves the efficiency of training neural net-works. Instead of randomly dropping out neurons in every training iteration, gyro dropout pre-selects and trains a fixed number of subnetworks. Because each subnetwork is more stably trained, they are more diversified and thus their ensemble achieves good generalization. We further propose block-wise gyro dropout, or simply block-wise dropout, which is a GPU-friendly variant of gyro dropout. Block-wise dropout partitions hidden neurons into a number of groups that should be dropped out together throughout learning; this makes it efficient to prune the corresponding warp executions on GPUs. We evaluate the two dropout methods with seven neural networks and ten public datasets. In our evaluation, gyro dropout improves the accuracy of trained models by up to 1.93%; gyro dropout consistently achieves higher accuracy than conventional dropout in all experiments. Moreover, block-wise dropout speeds up the training of neural networks by up to 29.8% with little to no accuracy loss. Ourimplementation of gyro dropout is publicly available at https://github.com/mlsys-seo/gyro-dropout.
Invited Talk: Ryan Adams
Accelerating Engineering with Machine Learning
Advances in machine learning are having exciting impacts in a variety of domains from neuroscience to chemistry to astronomy. Machine learning has been somewhat slow to interface with more traditional engineering domains such as mechanical, civil, and chemical engineering. Nevertheless, there is huge potential for ML to accelerate everything in the engineering design cycle: CAD, simulation, fabrication, and control. In this talk I will discuss some of the exciting work starting to happen at this interface—from deep generative modeling to differentiable physical simulation---and take a look forward at what might be possible.
Ryan Adams is a machine learning researcher and Professor of Computer Science at Princeton University. Ryan completed his Ph.D. in physics under David MacKay at the University of Cambridge, where he was a Gates Cambridge Scholar and a member of St. John's College. Following his Ph.D. Ryan spent two years as a Junior Research Fellow at the University of Toronto as a part of the Canadian Institute for Advanced Research. From 2011-2016, he was an Assistant Professor at Harvard University in the School of Engineering and Applied Sciences. In 2015, Ryan sold the company he co-founded, Whetlab, to Twitter and he spent three years in industry at Twitter and Google before joining the faculty at Princeton in 2018. Ryan has won paper awards at ICML, UAI, and AISTATS, received the DARPA Young Faculty Award and the Alfred P. Sloan Fellowship. He also co-hosted the popular Talking Machines podcast.
Bio :Tutorial: Tushar Krishna
ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed Deep Learning Training Platforms
Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g.,
TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVlinks). Examples today
include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a
complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective
communication algorithm, network topology, and the accelerator endpoint, as shown in the figure above.
Collective communications (e.g., all-reduce, all-to-all, reduce-scatter, all-gather) are initiated at different phases for different parallelism approaches – and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 and Recommendations such as DLRM have billions to trillions of parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design-space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models.
As an ongoing collaboration between Intel, Facebook and Georgia Tech, we have been jointly developing a detailed cycle-
accurate distributed training simulator called ASTRA-sim. ASTRA-sim models the co-design space described above and
schedules the compute-communication interactions from distributed training over plug-and-play compute and network
simulators. It enables a systematic study of bottlenecks at the software and hardware level for scaling training. It also enables end-to-end design-space exploration for running large DNN models over future training platforms. Papers detailing ASTRA-sim were presented at ISPASS 2020 and Hot Interconnects 2020. Currently, ASTRA-sim uses SCALE-sim (a Google TPU like simulator) as its compute model and provides a suite of network models (analytical network, Garnet from gem5 and NS3) to go from simple analytical to detailed cycle-accurate simulation of large-scale training platforms. To the best of our knowledge, ASTRA-sim is the first open-source simulator for modeling future distributed training platforms.
In this tutorial, we will educate the research community about the challenges in the emerging domain of distributed training, demonstrate the capabilities of ASTRA-sim with examples and discuss ongoing development efforts.
Oral: ML Programming Models and Abstractions & Interpretability and Explainability of ML Wed 31 Aug 01:00 p.m.
[ Exhibit Hall A ]
Graph Neural Networks (GNNs) have been widely used in various domains, and GNNs with sophisticated computational graph lead to higher latency and larger memory consumption. Optimizing the GNN computational graph suffers from: (1) Redundant neural operator computation. The same data are propagated through the graph structure to perform the same neural operation multiple times in GNNs, leading to redundant computation which accounts for 92.4% of total operators. (2) Inconsistent thread mapping. Efficient thread mapping schemes for vertex-centric and edge-centric operators are different. This inconsistency prohibits operator fusion to reduce memory IO. (3) Excessive intermediate data. For GNN training which is usually performed concurrently with inference, intermediate data must be stored for the backward pass, consuming 91.9% of the total memory requirement.To tackle these challenges, we propose following designs to optimize the GNN computational graph from a novel coordinated computation, IO, and memory perspective: (1) Propagation-postponed operator reorganization. We reorganize operators to perform neural operations before the propagation, thus the redundant computation is eliminated. (2) Unified thread mapping for fusion. We propose a unified thread mapping scheme for both vertex- and edge-centric operators to enable fusion and reduce IO. (3) Intermediate data recomputation. Intermediate data are recomputed during the backward pass …
[ Exhibit Hall A ]
Modern deep learning frameworks provide imperative, eager execution programming interfaces embedded in Python to provide a productive development experience. However, deep learning practitioners sometimes need to capture and transform program structure for performance optimization, visualization, analysis, and hardware integration. We study the different designs for program capture and transformation used in deep learning. By designing for typical deep learning use cases rather than long tail ones, it is possible to create a simpler framework for program capture and transformation. We apply this principle in torch.fx, a program capture and transformation library for PyTorch written entirely in Python and optimized for high developer productivity by ML practitioners. We present case studies showing how torch.fx enables workflows previously inaccessible in the PyTorch ecosystem.
[ Exhibit Hall A ]
Machine learning (ML) models may involve decision boundaries that change over time due to updates to rules and regulations, such as in loan approvals or claims management. However, in such scenarios, it may take time for sufficient training data to accumulate in order to retrain the model to reflect the new decision boundaries. While work has been done to reinforce existing decision boundaries, very little has been done to cover these scenarios where decision boundaries of the ML models should change in order to reflect new rules. In this paper, we focus on user-provided feedback rules as a way to expedite the ML models' update process, and we formally introduce the problem of pre-processing training data to edit an ML model in response to feedback rules such that once the model is retrained on the pre-processed data, its decision boundaries align more closely with the rules. To solve this problem, we propose a novel data augmentation method, the Feedback Rule-Based Oversampling Technique (FROTE). Extensive experiments using different ML models and real world datasets demonstrate the effectiveness of the method, in particular the benefit of augmentation and the ability to handle many feedback rules.
[ Exhibit Hall A ]
We introduce TyXe, a Bayesian neural network library built on top of Pytorch and Pyro. Our leading design principle is to cleanly separate architecture, prior, inference and likelihood specification, allowing for a flexible workflow where users can quickly iterate over combinations of these components. In contrast to existing packages TyXe does not implement any layer classes, and instead relies on architectures defined in generic Pytorch code. TyXe then provides modular choices for canonical priors, variational guides, inference techniques, and layer selections for a Bayesian treatment of the specified architecture. Sampling tricks for variance reduction, such as local reparameterization or flipout, are implemented as effect handlers, which can be applied independently of other specifications. We showcase the ease of use of TyXe to explore Bayesian versions of popular models from various libraries: toy regression with a pure Pytorch neural network; large-scale image classification with torchvision ResNets; graph neural networks based on DGL; and Neural Radiance Fields built on top of Pytorch3D. Finally, we provide convenient abstractions for variational continual learning. In all cases the change from a deterministic to a Bayesian neural network comes with minimal modifications to existing code, offering a broad range of researchers and practitioners alike practical access …
Tutorial: Radu Marculescu · Ming Lin · Atlas Wang · Kartikeya Bhardwaj
Training-Free Approaches for Edge AI: Challenges, Opportunities and Progress
With the explosion in Big Data, it is often forgotten that much of the data nowadays is generated at the edge. Specifically, a major source of data is users’ endpoint devices like phones, smart watches, etc., that are connected to the internet, also known as the Internet-of-Things (IoT). Despite the huge success of deep learning (DL) in many areas (e.g., computer vision, natural language processing, etc.), the size and the computational complexity of the existing state-of-the art deep models limit the deployment of DL on resource-constrained devices and its large-scale adoption in EdgeAI. Neural architecture search (NAS) (also called AutoML) techniques have been proposed to automatically design neural architectures with reduced model sizes. The networks obtained via NAS have higher prediction accuracy and significantly fewer parameters than the hand-crafted networks. However, adapting existing NAS approaches to different hardware architectures is challenging due to their intensive computation and execution time requirements.
To address such issues, in this tutorial, we focus on the newest and perhaps the most promising breed of NAS for EdgeAI, namely approaches that are training-free and thus eminently suited for large-scale development. In particular, we plan to address a few relevant questions: What kind of system architectures can meet the AI algorithm requirements, while maximizing the prediction accuracy, inference speed as well as energy efficiency? Can we use network science or deep learning theory to understand what kind of network architectures can achieve good performance without training individual models? Can we develop efficient approaches that enable co-optimization of deep neural network accuracy and performance on real hardware platforms?
Starting from these overarching ideas, in this tutorial, we will cover both algorithmic and hardware- aware aspects of training-free model design for EdgeAI, show state-of-the-art results for relevant edge applications, and illustrate potential implications on real edge devices.
PRESENTERS: Radu Marculescu (The University of Texas at Austin), Ming Lin (Amazon), Atlas Wang (The University of Texas at Austin), Kartikeya Bhardwaj (ARM).
Conference tutorial page available here.

Oral: Efficient Inference and Model Serving Wed 31 Aug 02:15 p.m.
[ Exhibit Hall A ]
Recent progress in quantization techniques has demonstrated the feasibility of sub-8-bit quantization with a negligible end-to-end accuracy drop. However, today’s commodity hardware such as CPUs and GPUs is still suboptimal in executing these sub-8-bit quantized networks as its SIMD instructions only support the granularity of 8 bits or wider. This paper presents ULPPACK, a software technique to accelerate those ultra low-precision networks via effective operand packing. The key idea of ULPPACK is to pack multiple low-precision (<8 bits) operands densely into a single wide (16 bits) register and perform multiple narrow multiply-accumulate (MAC) operations with a single wide multiply. We introduce two effective packing schemes with different tradeoffs as well as optimizations to amortize the overhead of shifting and masking the output partial sum. Our evaluation of ULPPACK with a 512x512x512 GEMM kernel demonstrates substantial performance gains over state-of-the-art low-precision linear algebra libraries with a speedup of 2.1x, 1.8x, and 2.7x for 3-bit weights/activations (W3A3) over Google’s GEMMLOWP, Facebook’s QNNPACK, and an optimized bit-serial implementation, respectively. For end-to-end evaluation on PyTorch with seven 3-bit quantized convolutional neural networks (CNNs), ULPPACK achieves geomean speedups of 3.9x and 1.5x over the baseline 32-bit floating-point (FP32) and QNNPACK, respectively.
[ Exhibit Hall A ]
With more videos being recorded by edge sensors (cameras) and analyzed by computer-vision deep neural nets (DNNs), a new breed of video streaming systems has emerged, with the goal to compress and stream videos to remote servers in real time while preserving enough information to allow highly accurate inference by the server-side DNNs. An ideal design of the video streaming system should simultaneously meet three key requirements: (1) low latency of encoding and streaming, (2) high accuracy of server-side DNNs, and (3) low compute overheads on the camera. Unfortunately, despite many recent efforts, such video streaming system has hitherto been elusive, especially when serving advanced vision tasks such as object detection or semantic segmentation.This paper presents AccMPEG, a new video encoding and streaming system that meets the three objectives. The key is to learn how much the encoding quality at each (16x16) macroblock can influence the server-side DNN accuracy, which we call accuracy gradients. Our insight is that these macroblock-level accuracy gradients can be inferred with sufficient precision by feeding the video frames through a cheap model. AccMPEG provides a suite of techniques that, given a new server-side DNN, can quickly create a cheap model to infer the accuracy gradients …
[ Exhibit Hall A ]
Efficient inference in large output space is an essential yet challenging task in large scale machine learning. Previous approaches reduce this problem to Approximate Maximum Inner Product Search (AMIPS), which is based on the observation that the prediction of a given model corresponds to the logit with the largest value. However, models are not perfect in accuracy, and the successful retrievals of the largest logit may not lead to the correct predictions. We argue that approximate MIPS approaches are sub-optimal because they are tailored for retrieving largest inner products class instead of retrieving the correct class. Moreover, the logits generated from neural networks with large output space lead to extra challenges for the AMIPS method to achieve a high recall rate within the computation budget of efficient inference. In this paper, we propose HALOS, which reduces inference into sub-linear computation by selectively activating a small set of output layer neurons that are likely to correspond to the correct classes rather than to yield the largest logit. Our extensive evaluations show that HALOS matches or even outperforms the accuracy of given models with 21x speed up and 87\% energy reduction.
[ Exhibit Hall A ]
In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies. An embedding layer maps each entity to a unique vector, causing the layer’s memory requirement to be proportional to the number of entities. In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory. The scale of these networks makes them difficult to deploy in resource constrained environments, such as smartphones. In this paper, we propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding. Rather than maintaining the full embedding table, we construct each entity’s embedding “on the fly” using two separate embedding tables. The first table employs hashing to force multiple entities to share an embedding. The second table contains one trainable weight per entity, allowing the model to distinguish between entities sharing the same embedding. Since these two tables are trained jointly, the network is able to learn a unique embedding per entity, helping it maintain a discriminative capability similar to a model with an uncompressed embedding table. We call this approach MEmCom (Multi-Embedding Compression). We …
[ Exhibit Hall A ]
Today’s auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt bridges this gap and achieves the best of both worlds by using hardware-native templated search, which is enabled by the recent trend that vendor libraries (e.g., CUTLASS) are increasingly modularized and reconfigurable. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. We demonstrate this concept by prototyping in TVM on NVIDIA GPUs—both in large deployment in our production environment. Our experiments show that Bolt can improve the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes.
[ Exhibit Hall A ]
While deep learning methods continue to improve in predictive accuracy on a wide range of application domains, significant issues remain with other aspects of their performance, including their ability to quantify uncertainty and their robustness. Recent advances in approximate Bayesian inference hold significant promise for addressing these concerns, but the computational scalability of these methods can be problematic when applied to large-scale models. In this paper, we present URSABench (the Uncertainty, Robustness, Scalability, and Accuracy Benchmark), an open-source suite of models, inference methods, tasks and benchmarking tools. URSABench supports comprehensive assessment of Bayesian deep learning models and approximate Bayesian inference methods, with a focus on classification tasks performed both on server and edge GPUs.