Hasan Genc

Please register online at this link if you would like to attend: https://forms.gle/Pd9tviuBHBno7G7F7

We present a tutorial that teaches users how to perform full-system, full-stack DNN accelerator evaluation using the Gemmini platform. Gemmini allows users to evaluate how a DNN hardware accelerator interacts with external components, like the cache hierarchy or virtual address translation scheme, to affect performance across the hardware-software-system stack.

With Gemmini, users can generate a variety of different DNN hardware accelerators, with different underlying system, SoC, and programming stack components. Users can evaluate the performance of their hardware accelerators on end-to-end workloads in a real-world system context, exposing how different system components, like the cache hierarchy, virtual address translation scheme, or operating system, impact performance in subtle but noticeable ways. Gemmini also allows users to program their applications at different “levels” of the programming stack, from high-level model compilation to low-level direct machine configuration. Overall, Gemmini enables users to explore and evaluate a variety of different DNN accelerator and system configurations, exposing how these different parameters interact to impact end-to-end performance and efficiency.

Gemmini has been presented previously at DAC 2021, where it won the Best Paper award, as
well as at an IISWC 2021 tutorial.

Patel Dhaval

This tutorial presents a design and implementation of a scikit-compatible system for detecting anomalies from time series data for the purpose of offering a broad range of algorithms to the end user, with special focus on unsupervised/semi-supervised learning. Given an input time series, we discuss how data scientist can construct four categories of anomaly pipelines followed by an enrichment module that helps to label anomaly. The tutorial provides an hand-on-experience using a deployed system on IBM API Hub for developer communities that aim to support a wide range of execution engines to meet the diverse need of anomaly workloads such as Serveless for CPU intensive work, GPU for deep-learning model training, etc.

Burak Aksar

This tutorial aims to introduce the audience to ML-based telemetry analytics for large-scale computing systems to improve system performance, resilience, and power efficiency. Modern large-scale computing systems (i.e., data centers, High-Performance Computing clusters, etc.) are highly parallel systems that perform numerous complex operations concurrently, and they are critical for many societal and scientific applications. These complex systems support higher degrees of parallelism, which often leads to significant resource contention and eventually to performance variability and loss of efficiency. One way to assess system performance and identify the root causes of problems is by gathering and inspecting telemetry data. Such telemetry (of hundreds or thousands of hardware and software sensors) and log data are readily
available on any computer system today. As this system data contains billions of data points per day, manual analysis is impractical and has limited benefits. Considering the limitations of manual analysis, ML is emerging as a promising approach to automate performance analytics. Also, computer system telemetry analytics is a challenging application area with many open problems since labeled data is scarcely available, whereas unlabeled data can reach up to the scale of terabytes per day.

The goal of this tutorial is twofold. First, the tutorial provides …

Mert Hidayetoglu · Jinjun Xiong · Wen-Mei Hwu · Rakesh Nagi · Vikram Sharma Mailthody · Jeff Pool · Sitao Huang

A plethora of ML models are either sparsified (such as deep neural networks) to save memory footprint and FLOPs or inherently sparse due to their unstructured nature (such as graph neural networks). Nevertheless, even though sparsity is desired in theory, it often hampers the performance in practice because existing heterogeneous systems (such as GPUs and FPGAs) fall short in irregular computations. For example, as the GPU architectures are optimized for regular, dense computations, only a tiny portion of the theoretical GPU performance is realized when performing sparse computation. In this tutorial, we discuss the source of sparsity in deep neural networks as well as key techniques for mapping the sparse computation on heterogeneous systems to support high-performance inference and training. We will conclude this tutorial with a discussion on future work on model parallelism for optimizing sparse communications for large-scale sparse ML models.


The need to deliver code changes to production systems to satisfy new requirements has fueled the adoption of an agile software development practice called onlineexperimentation. Online experimentation provides insight into the value delivered by new application versions as they are exposed to users.

To solve the online experimentation problem for web and mobile applications, practitioners use A/B tests or more advanced methods such as multi-armed bandit algorithms. These approaches entail comparing and assessing application versions online to determine the best version based on business requirements such as user-engagement. However, existing techniques and their formulations do not capture the unique complexities in cloud systems.

When assessing the outcomes of releases of microservices or machine learning (ML) models in the cloud, practitioners must simultaneously consider application performance as well as business metrics. This difference arises because a cloud application’s behavior is inherently volatile due to an increased likelihood of performance bugs or variability, which can degrade desired business results. For example, Amazon reported that every 100ms of latency costs them 1% in sales. As a result of these complexities, the deployment of cloud applications is more art than science when contrasted with the approaches adopted in the web and mobile domains. However, …

Tushar Krishna

Modern Deep Learning systems heavily rely on distributed training over customized high-performance accelerator (e.g.,
TPU, GPU)-based hardware platforms connected via high-performance interconnects (e.g., NVlinks). Examples today
include NVIDIA’s DGX-2, Google’s Cloud TPU and Facebook’s Zion. Deep Neural Network (DNN) training involves a
complex interplay between the DNN model architecture, parallelization strategy, scheduling strategy, collective
communication algorithm, network topology, and the accelerator endpoint, as shown in the figure above.

Collective communications (e.g., all-reduce, all-to-all, reduce-scatter, all-gather) are initiated at different phases for different parallelism approaches – and play a crucial role in overall runtime, if not hidden efficiently behind compute. This problem becomes paramount as recent models for NLP such as GPT-3 and Recommendations such as DLRM have billions to trillions of parameters and need to be scaled across tens to hundreds to thousands of accelerator nodes. As innovation in AI/ML models continues to grow at an accelerated rate, there is a need for a comprehensive methodology to understand and navigate this complex design-space to (i) architect future platforms and (ii) develop novel parallelism schemes to support efficient training of future DNN models.

As an ongoing collaboration between Intel, Facebook and Georgia Tech, we have been jointly developing a detailed cycle- …

Radu Marculescu · Ming Lin · Atlas Wang · Kartikeya Bhardwaj

With the explosion in Big Data, it is often forgotten that much of the data nowadays is generated at the edge. Specifically, a major source of data is users’ endpoint devices like phones, smart watches, etc., that are connected to the internet, also known as the Internet-of-Things (IoT). Despite the huge success of deep learning (DL) in many areas (e.g., computer vision, natural language processing, etc.), the size and the computational complexity of the existing state-of-the art deep models limit the deployment of DL on resource-constrained devices and its large-scale adoption in EdgeAI. Neural architecture search (NAS) (also called AutoML) techniques have been proposed to automatically design neural architectures with reduced model sizes. The networks obtained via NAS have higher prediction accuracy and significantly fewer parameters than the hand-crafted networks. However, adapting existing NAS approaches to different hardware architectures is challenging due to their intensive computation and execution time requirements.

To address such issues, in this tutorial, we focus on the newest and perhaps the most promising breed of NAS for EdgeAI, namely approaches that are training-free and thus eminently suited for large-scale development. In particular, we plan to address a few relevant questions: What kind of system architectures can …