Registration Desk Wed 7 Jun 07:30 a.m.
Poster: Compilers Wed 7 Jun 09:00 a.m.

Abstract
Pipelining between data loading and computation is a critical tensor program optimization for GPUs. In order to unleash the high performance of latest GPUs, we must perform a synergetic optimization of multi-stage pipelining across the multi-level buffer hierarchy of GPU. Existing frameworks rely on hand-written libraries such as cuBLAS to perform pipelining optimization, which is inextensible to new operators and un-composable with prior tensor compiler optimizations. This paper presents ALCOP, the first framework that is compiler-native and fully supports multi-stage multi-level pipelining. ALCOP overcomes three critical obstacles in generating code for pipelining: detection of pipelining-applicable buffers, program transformation for multi-level multi-stage pipelining, and efficient schedule parameter search by incorporating static analysis. Experiments show that ALCOP can generate programs with 1.23× speedup on average (up to 1.73×) over vanilla TVM. On end-to-end models, ALCOP can improve upon TVM by up to 1.18×, and XLA by up to 1.64×. Besides, our performance model significantly improves the efficiency of the schedule tuning process and can find schedules with 99% of the performance given by exhaustive search while costing 40× fewer trials.

Abstract
As emerging applications are rapidly moving to accelerators, a greatdeal of research has been proposed to improve the performance of the accelerators. For the AI applications, fruitful software-driven research has been focused on proposing new programming languages, new kernel fusion heuristics,new optimization tuning approaches, and new software execution engines. However, how to leverage classical compiler optimizations to generate efficient code is an overlooked aspect of performance. In this paper, we propose a whole-program analysis and optimization compiler framework, SIRIUS, to uniformly model the host and kernel computations in a unified polyhedral representation and,further, seek maximal fusion opportunities from the global view so that the fused kernel can benefit from classical optimizations. Evaluations over representative DNN models demonstrate that SIRIUS can achieve up to 11.98x speedup over TensorRT, and 154.84x speedup over TensorFlow. In particular, for BERT, SIRIUS can achieve 1.46x speedup over TensorRT.

Abstract
Tensor graph superoptimisation systems perform a sequence of subgraph substitution to neural networks, to find the optimal computation graph structure. Such a graph transformation process naturally falls into the framework of sequential decision-making, and existing systems typically employ a greedy search approach, which cannot explore the whole search space as it cannot tolerate a temporary loss of performance. In this paper, we address the tensor graph superoptimisation problem by exploring an alternative search approach, reinforcement learning (RL). Our proposed approach, X-RLflow, can learn to perform neural network dataflow graph rewriting, which substitutes a subgraph one at a time. X-RLflow is based on a model-free RL agent that uses a graph neural network (GNN) to encode the target computation graph and outputs a transformed computation graph iteratively. We show that our approach can outperform state-of-the-art superoptimisation systems over a range of deep learning models and achieve by up to 40% on those that are based on transformer-style architectures. We show that our approach can outperform state-of-the-art superoptimisation systems over a range of deep learning models and achieve by up to 40% on those that are based on transformer-style architectures.
Poster: Emerging Models and Domains Wed 7 Jun 10:30 a.m.

Abstract
Real-time multi-task multi-model (MTMM) workloads, a new form of deep learning inference workloads, are emerging for applications areas like extended reality (XR) to support metaverse use cases. These workloads combine user interactivity with computationally complex machine learning (ML) activities. Compared to standard ML applications, these ML workloads present unique difficulties and constraints. Real-time MTMM workloads impose heterogeneity and concurrency requirements on future ML systems and devices, necessitating the development of new capabilities. This paper begins with a discussion of the various characteristics of these real-time MTMM ML workloads and presents an ontology for evaluating the performance of future ML hardware for XR systems. Next, we present XRBENCH, a collection of MTMM ML tasks, models, and usage scenarios that execute these models in three representative ways: cascaded, concurrent, and cascaded-concurrency for XR use cases. Finally, we emphasize the need for new metrics that capture the requirements properly. We hope that our work will stimulate research and lead to the development of a new generation of ML systems for XR use cases. XRBench is available as an open-source project: https://github.com/XRBench

Abstract
The goal of Extreme Multi-label Classification (XC) is to learn representations that enable mapping input texts to the most relevant subset of labels selected from an extremely large label set, potentially in hundreds of millions. Given the extreme scale, conventional wisdom believes it is infeasible to train an XC model in an end-to-end manner. Thus, for training efficiency, several modular and sampling-based approaches to XC training have been proposed in the literature. In this paper, we identify challenges in the end-to-end training of XC models and devise novel optimizations that improve training speed over an order of magnitude, making end-to-end XC model training practical. Furthermore, we show that our end-to-end trained model, Renee, delivers state-of-the-art accuracy in a wide variety of XC benchmark datasets. Renee code will be released publicly.

Abstract
Hypergraph Neural Network (HyperGNN) is an emerging type of Graph Neural Networks (GNNs) that can utilize hyperedges to model high-order relationships among vertices. Current GNN frameworks fail to fuse two message-passing steps from vertices to hyperedges and hyperedges to vertices, leading to high latency and redundant memory consumption. The following challenges need to be solved for efficient fusion in HyperGNNs: (1) Inefficient partition: hardware-efficient and workload-balanced partitions are required for parallel workers to process two consecutive message passing steps after fusion. (2) Workload-Agnostic Format: current data formats like Compressed Sparse Row (CSR) fail to represent a two-step computation workload. (3) Heavy writing conflicts: partitioning leads to heavy writing conflicts when updating the same vertex.To enable efficient fusion for HyperGNNs, we present HyperGef. HyperGef proposes an edge-split workload balance partition scheme to achieve higher efficiency and better workload balancing. To represent the workload after fusion and partition, HyperGef introduces a novel fusion workload-aware format. HyperGef also introduces a shared memory-aware grouping scheme to reduce writing conflicts. Extensive experiments demonstrate that our fused kernel outperforms the NVIDIA cuSPARSE kernel by 3.31x. By enabling efficient fusion for HyperGNNs, HyperGef achieves 2.25x to 3.99x end-to-end speedup on various HyperGNN models compared with state-of-the-art frameworks …
Poster: Sparsity 2: Systems Wed 7 Jun 01:30 p.m.

Abstract
N:M sparsity is becoming increasingly popular due to its promise to achieve both high model accuracy and computational efficiency for deep learning. However, the real-world benefit of N:M sparsity is limited as there is a lack of dedicated GPU kernel implementations for general N:M sparsity with various sparsity ratios. In this work, we present nmSPARSE, a library of efficient GPU kernels for two fundamental operations in neural networks with N:M sparse weights: sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpMM). By leveraging the intrinsic balance characteristic of N:M sparsity, nmSPARSE kernels rearrange irregular computation and scattered memory accesses in sparse matrix multiplication into hardware-aligned regular computation and conflict-free memory accesses at runtime. Evaluated on NVIDIA A100 GPU, nmSPARSE kernels achieve up to 5.2× speedup on SpMV and 6.0× speedup on SpMM over the fastest baseline. End-to-end studies on transformer models demonstrate that using nmSPARSE outperforms other baselines.
Abstract
This paper introduces a Unified Convolution Framework (UCF) that incorporates various existing sparse convolutions in a unified abstraction. This work is in contrast to the common library-based approach that requires much engineering effort because each different sparse convolution must be implemented separately. Instead, it employs a tensor compiler approach that can flexibly explore convolutions with various program transformations; however, no compiler can currently support various sparse convolutions flexibly to our knowledge. In particular, the Tensor Algebra Compiler (TACO) can support a variety of sparse formats but cannot declare convolutions because a tensor cannot be accessed by a linear combination of index variables. We extend TACO's Einsum language to support an affine index expression to declare a convolution. Our method is also compatible with TACO's format and scheduling language, enabling various sparse convolution implementations to be explored. Our experimental results demonstrate that TACO-UCF achieves 1.32× and 8.3× average speedups on a filter sparse convolution and a submanifold sparse convolution, respectively, over state-of-the-art libraries on CPU. TACO-UCF on GPU outperforms the state-of-the-art GPU library on filter sparse convolution of ResNet50 by an average of 1.47× at 80% sparsity. We also demonstrate TACO-UCF outperforms on a neighbor retrieval of a submanifold sparse convolution …

Abstract
Sparse convolution is the key operator in widely-used 3D point cloud networks. However, due to the high sparsity of voxelized input point cloud data, three main challenges need to be solved for efficient sparse convolution in current 3D point cloud engines: (1) Memory under-utilization: the mapping information from input data to weight parameters of 3D point cloud networks is sparse, leading to up to 79.97% redundant memory access and under-utilized memory space; (2) Computation under-utilization: previous FGMS (Fused Gather-Matrix-Multiplication-Scatter) operations in sparse convolution are executed sequentially, leading to a GPU computation utilization of only 22.84%; (3) Input dynamics: a single and static dataflow in the current point cloud engines cannot always achieve the best performance on different input point cloud data.To tackle these challenges, we propose PCEngine, an efficient sparse convolution engine for voxel-based 3D point cloud networks. PCEngine proposes a novel coded-CSR (Compress Sparse Row) format to represent the mapping information without redundancy. PCEngine also introduces the indicator-assisted segmented FGMS fusion scheme to fully utilize the computation resources on GPU hardware. PCEngine further deploys a heuristic adaptive dataflow for input dynamics. Extensive experimental results show that, PCEngine achieves 1.81× and 1.64× speedup on average for sparse convolution operation and …

Abstract
This paper presents a new algorithm-hardware co-optimization approach that maximizes memory bandwidth utilization even for the pruned deep neural network (DNN) models.Targeting the well-known model compression approaches, for the first time, we carefully investigate the memory interface overheads caused by the irregular data accessing patterns.Then, the sparsity-aware memory interface architecture is newly developed to regularly access all the data of pruned-DNN models stored with the state-of-the-art XORNet compression.Moreover, we introduce the novel stacked XORNet solution for minimizing the number of data imbalances, remarkably relaxing the interface costs without slowing the effective memory bandwidth.As a result, experimental results show that our co-optimized interface architecture can achieve almost the ideal model-accessing speed with reasonable hardware overheads, successfully allowing the high-speed pruned-DNN inference scenarios.
Poster: Storage, Scheduling, and Networking Wed 7 Jun 03:20 p.m.

Abstract
Sharding a large machine learning model across multiple devices to balance the costs is important in distributed training. This is challenging because partitioning is NP-hard, and estimating the costs accurately and efficiently is difficult. In this work, we explore a "pre-train, and search" paradigm for efficient sharding. The idea is to pre-train a universal and once-for-all neural network to predict the costs of all the possible shards, which serves as an efficient sharding simulator. Built upon this pre-trained cost model, we then perform an online search to identify the best sharding plans given any specific sharding task. We instantiate this idea in deep learning recommendation models (DLRMs) and propose NeuroShard for embedding table sharding. NeuroShard pre-trains neural cost models on augmented tables to cover various sharding scenarios. Then it identifies the best column-wise and table-wise sharding plans with beam search and greedy grid search, respectively. Experiments show that NeuroShard significantly and consistently outperforms the state-of-the-art on the benchmark sharding dataset, achieving up to 23.8% improvement. When deployed in an ultra-large production DLRM with multi-terabyte embedding tables, NeuroShard achieves 11.6% improvement in embedding costs over the state-of-the-art, which translates to 6.6% end-to-end training throughput improvement. To facilitate future research of the …

Abstract
In this paper, we identify that modern GPUs - the key platform for developing neural networks - are being severely underutilized, with ∼ 50% utilization, that further drops as GPUs get faster. We show that state-of-the-art training techniques that employ operator fusion and larger mini-batch size to improve GPU utilization are limited by memory and do not scale with the size and number of models. Additionally, we show that using state-of-the art data swapping techniques (between GPU and host memory) to address GPU memory limitations lead to massive computation stalls as network sizes grow.We introduce μ-two, a novel compiler that maximizes GPU utilization. At the core of μ-two is an approach that leverages selective data swapping from GPU to host memory only when absolutely necessary, and maximally overlaps data movement with independent computation operations such that GPUs never have to wait for data. By collecting accurate run-time statistics and data dependencies, μ-two automatically fuses operators across different models, and precisely schedules data movement and computation operations to enable concurrent training of multiple models with minimum stall time. We show how to generate μ-two schedules for diverse neural network and GPU architectures and integrate μ-two into the PyTorch framework. Our experiments …
Abstract
Distributed training technologies have advanced rapidly in the past few years and have unlocked unprecedented scalability with increasingly complex solutions. These technologies have made distributed training much more efficient and accessible, though they impose specific constraints on the training paradigm or the model structure. As a result, applications that fail to meet these constraints must rely on general-purpose distributed computing frameworks to scale out. However, without access to the internal components of deep learning frameworks, these distributed computing frameworks usually significantly fall short in terms of efficiency and usability. To address these problems, we propose PyTorch RPC as a generic and high-performance solution for distributed deep learning. Compared to generic distributed computing frameworks, PyTorch RPC natively provides essential features for implementing training applications in a distributed environment, including optimized tensor communications, remote memory management, and distributed autograd. Evaluations show that PyTorch RPC attains up to two orders of magnitude faster tensor communication compared to gRPC with one-tenth of the user code. Case studies further demonstrate that users can easily employ PyTorch RPC to build efficient reinforcement learning applications (video game solver), implement large language models (GPT3), train recommendation models (DLRM), and scale federated learning tasks (FedML).

Abstract
We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features’ values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48×, 1.79×, and 3.71×, respectively, in an industry-scale DLRM training system.
Poster: Edge Wed 7 Jun 04:40 p.m.
Abstract
In the domain of computer vision, transformer models have shown noteworthy success, prompting extensive research on optimizing their inference, particularly concerning their deployment on edge devices. While quantization has emerged as a viable solution for enabling energy efficiency in Convolutional Neural Networks (CNNs), achieving direct quantization of complex activation and normalization operators in transformer models proves to be a challenging task. Existing methods that rely on 64-bit integers often suffer from data truncation issues when deployed to energy-constrained edge devices, resulting in a significant loss of model accuracy. In this paper, we propose a range-constrained quantization technique for activation and normalization operators in transformers that addresses the dilemma between data range and precision. Our approach is the first 32-bit integer-based edge kernel implementation for vision transformers with post-training integer-only quantization, ensuring both efficiency and accuracy. Experimental results demonstrate a remarkable 5 times kernel speedup when deployed on two different ARM CPUs, with negligible accuracy loss in comparison to full-precision vision transformers. This innovative work is poised to significantly impact the deployment of transformer models on energy-efficient edge devices.
Abstract
Edge Impulse is a cloud-based machine learning operations (MLOps) platform for developing embedded and edge ML (TinyML) systems that can be deployed to a wide range of hardware targets. Current TinyML workflows are plagued by fragmented software stacks and heterogeneous deployment hardware, making ML model optimizations difficult and unportable. We present Edge Impulse, a practical MLOps platform for developing TinyML systems at scale. Edge Impulse addresses these challenges and streamlines the TinyML design cycle by supporting various software and hardware optimizations to create an extensible and portable software stack for a multitude of embedded systems. As of Oct. 2022, Edge Impulse hosts 118,185 projects from 50,953 developers.

Abstract
A growing number of applications depend on Machine Learning (ML) functionality and benefits from both higher quality ML predictions and better timeliness (latency) at the same time. A growing body of research in computer architecture, ML, and systems software literature focuses on reaching better latency/accuracy tradeoffs for ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, as well as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy. All of them, however, yield improvements for a single static point in the latency/accuracy tradeoff space. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses (activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherent temporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time. Combined, they are vertically integrated into SUSHI—an inference serving stack. For the stream of …