Moderator: Gennady Pekhimenko
Pratik Fegade · Tianqi Chen · Phillip Gibbons · Todd Mowry
There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, lead to a lot of wasted computation and therefore, a loss in performance. This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs. Evaluating CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model, we find that CoRa (i) performs competitively with hand-optimized implementations of the operators and the transformer encoder and (ii) achieves a 1.6 geomean speedup over PyTorch for the encoder on an Nvidia GPU and a 1.37 geomean speedup over TensorFlow for the multi-head attention module used in transformers on a 64-core ARM CPU.
Jie Zhao · Xiong Gao · Ruijie Xia · Zhaochuang Zhang · Deshi Chen · Lei Chen · Renwei Zhang · Zhen Geng · Bin Cheng · Xuefeng Jin
We study fusion for deep neural networks (DNNs) in a just-in-time (JIT) compilation framework Apollo. It considers both memory- and compute-bound tensor operators for fusion, and integrates graph-level node grouping and operator-level loop fusion closely, widening the fusion search space. Apollo enables the upward feedback from the downstream loop optimizer, enforcing the graph engine to regenerate partition patterns amenable to the downstream pass and thus resolving the scalability issue. Besides data locality, Apollo also exploits the parallelism between independent tensor operators, further improving the performance of DNN workloads. Experimental results on training workloads show that Apollo outperforms TensorFlow and XLA by 1.86× and 1.37× on a single GPU, and 1.96× and 1.18× on multiple GPUs. Apollo also improves the performance of a vendor-provided DNN framework by 19.7% on a domain-specific accelerator. In addition, the results of inference workloads demonstrate the general applicability of our fusion framework.
Bojian Zheng · Ziheng Jiang · Cody Hao Yu · Haichen Shen · Joshua Fromm · Yizhi Liu · Yida Wang · Luis Ceze · Tianqi Chen · Gennady Pekhimenko
Achieving high performance for compute-intensive operators in machine learning (ML) workloads is a crucial but challenging task. Many ML and system practitioners rely on vendor libraries or auto-schedulers to do the job. While the former requires large engineering efforts, the latter only supports static-shape workloads in existing works. It is difficult, if not impractical, to apply existing auto-schedulers directly to dynamic-shape workloads, as this leads to extremely long auto-scheduling time.We observe that the key challenge faced by existing auto-schedulers when handling a dynamic-shape workload is that they cannot construct a unified search space for all the possible shapes of the workload, because their search space is shape-dependent. To address this, we propose DietCode, a new auto-scheduler framework that efficiently supports dynamic-shape workloads by constructing a shape-generic search space and cost model. Under this construction, all shapes jointly search within the same space and update the same cost model when auto-scheduling, which is therefore more efficient compared with existing auto-schedulers.We evaluate DietCode using state-of-the-art machine learning workloads on a modern GPU. Our evaluation shows that DietCode has the following key strengths when auto-scheduling an entire model end-to-end: (1) reduces the auto-scheduling time by up to 5.88x less than the state-of-the-art auto-scheduler on the uniformly sampled dynamic shapes (94.1x estimated if all possible shapes are included), (2) improves performance by up to 69.5% better than the auto-scheduler and 18.6% better than the vendor library. All these advantages make DietCode an efficient and practical solution for dynamic-shape workloads.
Shurui Li · Puneet Gupta
Applications of neural networks on edge systems have proliferated in recent years but the ever increasing model size makes neural networks not able to deploy on resource-constrained microcontrollers efficiently. We propose bit-serial weight pools, an end-to-end framework that includes network compression and acceleration of arbitrary sub-byte precision. The framework can achieve up to 8x compression compared to 8-bit networks by sharing a pool of weights across the entire network. We further propose a bit-serial lookup based software implementation that allows runtime-bitwidth tradeoff and is able to achieve more than 2.8x speedup and 7.5x storage compression compared to 8-bit networks, with less than 1% accuracy drop.