Efficient Inference and Model Serving

Exhibit Hall A

Moderator: Jinjun Xiong

Wed 31 Aug 2:15 p.m. PDT — 4:03 p.m. PDT


Chat is not available.

Wed 31 Aug. 14:15 - 14:33 PDT

ULPPACK: Fast Sub-8-bit Matrix Multiply on Commodity SIMD Hardware

Jaeyeon Won · Jeyeon Si · Sam Son · Tae Jun Ham · Jae W. Lee

Recent progress in quantization techniques has demonstrated the feasibility of sub-8-bit quantization with a negligible end-to-end accuracy drop. However, today’s commodity hardware such as CPUs and GPUs is still suboptimal in executing these sub-8-bit quantized networks as its SIMD instructions only support the granularity of 8 bits or wider. This paper presents ULPPACK, a software technique to accelerate those ultra low-precision networks via effective operand packing. The key idea of ULPPACK is to pack multiple low-precision (<8 bits) operands densely into a single wide (16 bits) register and perform multiple narrow multiply-accumulate (MAC) operations with a single wide multiply. We introduce two effective packing schemes with different tradeoffs as well as optimizations to amortize the overhead of shifting and masking the output partial sum. Our evaluation of ULPPACK with a 512x512x512 GEMM kernel demonstrates substantial performance gains over state-of-the-art low-precision linear algebra libraries with a speedup of 2.1x, 1.8x, and 2.7x for 3-bit weights/activations (W3A3) over Google’s GEMMLOWP, Facebook’s QNNPACK, and an optimized bit-serial implementation, respectively. For end-to-end evaluation on PyTorch with seven 3-bit quantized convolutional neural networks (CNNs), ULPPACK achieves geomean speedups of 3.9x and 1.5x over the baseline 32-bit floating-point (FP32) and QNNPACK, respectively.

Wed 31 Aug. 14:33 - 14:51 PDT

AccMPEG: Optimizing Video Encoding for Accurate Video Analytics

Kuntai Du · Kuntai Du · Qizheng Zhang · Qizheng Zhang · Anton Arapin · Anton Arapin · Haodong Wang · Haodong Wang · Zhengxu Xia · Zhengxu Xia · Junchen Jiang · Junchen Jiang

With more videos being recorded by edge sensors (cameras) and analyzed by computer-vision deep neural nets (DNNs), a new breed of video streaming systems has emerged, with the goal to compress and stream videos to remote servers in real time while preserving enough information to allow highly accurate inference by the server-side DNNs. An ideal design of the video streaming system should simultaneously meet three key requirements: (1) low latency of encoding and streaming, (2) high accuracy of server-side DNNs, and (3) low compute overheads on the camera. Unfortunately, despite many recent efforts, such video streaming system has hitherto been elusive, especially when serving advanced vision tasks such as object detection or semantic segmentation.This paper presents AccMPEG, a new video encoding and streaming system that meets the three objectives. The key is to learn how much the encoding quality at each (16x16) macroblock can influence the server-side DNN accuracy, which we call accuracy gradients. Our insight is that these macroblock-level accuracy gradients can be inferred with sufficient precision by feeding the video frames through a cheap model. AccMPEG provides a suite of techniques that, given a new server-side DNN, can quickly create a cheap model to infer the accuracy gradients on any new frame in near realtime. Our extensive evaluation of AccMPEG on two types of edge devices (one Intel Xeon Silver 4100 CPU or NVIDIA Jetson Nano) and three vision tasks (six recent pre-trained DNNs) shows that compared to the state-of-the-art baselines, AccMPEG (with the same camera-side compute resources) can reduce the end-to-end inference delay by 10-43% without hurting accuracy.

Wed 31 Aug. 14:51 - 15:09 PDT

HALOS: Hashing Large Output Space for Cheap Inference

Zichang Liu · Zhaozhuo Xu · Alan Ji · Junyan Zhang · Jonathan Li · Beidi Chen · Anshumali Shrivastava

Efficient inference in large output space is an essential yet challenging task in large scale machine learning. Previous approaches reduce this problem to Approximate Maximum Inner Product Search (AMIPS), which is based on the observation that the prediction of a given model corresponds to the logit with the largest value. However, models are not perfect in accuracy, and the successful retrievals of the largest logit may not lead to the correct predictions. We argue that approximate MIPS approaches are sub-optimal because they are tailored for retrieving largest inner products class instead of retrieving the correct class. Moreover, the logits generated from neural networks with large output space lead to extra challenges for the AMIPS method to achieve a high recall rate within the computation budget of efficient inference. In this paper, we propose HALOS, which reduces inference into sub-linear computation by selectively activating a small set of output layer neurons that are likely to correspond to the correct classes rather than to yield the largest logit. Our extensive evaluations show that HALOS matches or even outperforms the accuracy of given models with 21x speed up and 87\% energy reduction.

Wed 31 Aug. 15:09 - 15:27 PDT

Learning Compressed Embeddings for On-Device Inference

Niketan Pansare · Jay Katukuri · Aditya Arora · Frank Cipollone · Riyaaz Shaik · Noyan Tokgozoglu · Chandru Venkataraman

In deep learning, embeddings are widely used to represent categorical entities such as words, apps, and movies. An embedding layer maps each entity to a unique vector, causing the layer’s memory requirement to be proportional to the number of entities. In the recommendation domain, a given category can have hundreds of thousands of entities, and its embedding layer can take gigabytes of memory. The scale of these networks makes them difficult to deploy in resource constrained environments, such as smartphones. In this paper, we propose a novel approach for reducing the size of an embedding table while still mapping each entity to its own unique embedding. Rather than maintaining the full embedding table, we construct each entity’s embedding “on the fly” using two separate embedding tables. The first table employs hashing to force multiple entities to share an embedding. The second table contains one trainable weight per entity, allowing the model to distinguish between entities sharing the same embedding. Since these two tables are trained jointly, the network is able to learn a unique embedding per entity, helping it maintain a discriminative capability similar to a model with an uncompressed embedding table. We call this approach MEmCom (Multi-Embedding Compression). We compare with state-of-the-art model compression techniques for multiple problem classes including classification and ranking using datasets from various domains. On four popular recommender system datasets, MEmCom had a 4% relative loss in nDCG while compressing the input embedding sizes of our recommendation models by 16x, 4x, 12x, and 40x. MEmCom outperforms the state-of-the-art model compression techniques, which achieved 16%, 6%, 10%, and 8% relative loss in nDCG at the respective compression ratios. Additionally, MEmCom is able to compress the RankNet ranking model by 32x on a dataset with millions of users’ interactions with games while incurring only a 1% relative loss in nDCG.

Wed 31 Aug. 15:27 - 15:45 PDT

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Jiarong Xing · Leyuan Wang · Shang Zhang · Jack Chen · Ang Chen · Yibo Zhu

Today’s auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor programs by navigating a large search space to identify effective implementations, but they do so with opaque hardware details. Thus, their performance could fall behind that of hardware-native libraries (e.g., cuBLAS, cuDNN), which are hand-optimized by device vendors to extract high performance. On the other hand, these vendor libraries have a fixed set of supported functions and lack the customization and automation support afforded by auto-tuners. Bolt bridges this gap and achieves the best of both worlds by using hardware-native templated search, which is enabled by the recent trend that vendor libraries (e.g., CUTLASS) are increasingly modularized and reconfigurable. Bolt provides new opportunities to rethink end-to-end tensor optimizations at the graph, operator, and model levels. We demonstrate this concept by prototyping in TVM on NVIDIA GPUs—both in large deployment in our production environment. Our experiments show that Bolt can improve the inference speed of common convolutional neural networks by 2.5x on average over the state of the art, and it auto-tunes these models within 20 minutes.

Wed 31 Aug. 15:45 - 16:03 PDT

URSABench: A System for Comprehensive Benchmarking of Bayesian Deep Neural Network Models and Inference methods

Meet Vadera · Jinyang Li · Adam Cobb · Brian Jalaian · Tarek Abdelzaher · Benjamin Marlin

While deep learning methods continue to improve in predictive accuracy on a wide range of application domains, significant issues remain with other aspects of their performance, including their ability to quantify uncertainty and their robustness. Recent advances in approximate Bayesian inference hold significant promise for addressing these concerns, but the computational scalability of these methods can be problematic when applied to large-scale models. In this paper, we present URSABench (the Uncertainty, Robustness, Scalability, and Accuracy Benchmark), an open-source suite of models, inference methods, tasks and benchmarking tools. URSABench supports comprehensive assessment of Bayesian deep learning models and approximate Bayesian inference methods, with a focus on classification tasks performed both on server and edge GPUs.