Track: Session 9: Hardware

Thu 8 April 9:10 - 9:30 PDT

Boveda: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick

Isak Edo Vivancos · Sayeh Sharify · Daniel Ly-Ma · Ameer Abdelhadi · Ciaran Bannon · Milos Nikolic · Mostafa Mahmoud · Alberto Delmas Lascorz · Gennady Pekhimenko · Andreas Moshovos

Data access between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. On-chip memory compression can greatly reduce this energy cost as long as it balances the simplicity and low cost of the compression/decompression implementation and its effectiveness in data size reduction. We present Boveda, a simple and effective on-chip lossless memory compression technique for fixed-point precision networks. It reduces data widths by exploiting the value distribution deep learning applications naturally exhibit. Boveda can increase the effective on-chip capacity, reduce off-chip traffic, and/or achieve a desired performance/energy target while using smaller on-chip memories. Boveda can be placed after any memory block in the on-chip memory hierarchy and can work with \textul{any} data-parallel processing units such as the vector-like or the tensorcore units of modern graphics processors, systolic arrays such as that used in the Tensor Processing Unit, and units that process sparse tensors such as those used in the SCNN accelerator. To demonstrate the potential of Boveda, we implement it over (i) SCNN, a state-of-the-art accelerator for sparse networks, (ii) a Tensorcore-like architecture, and (iii) TPU. Boveda reduces memory footprint by 34\% for SCNN and sparse models on top of zero compression. For dense models, Boveda improves compression by 47\%. We also present a prototype FPGA implementation.

Thu 8 April 9:30 - 9:50 PDT

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Shang Wang · Peiming Yang · Yuxuan Zheng · Xin Li · Gennady Pekhimenko

Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to 15.1x higher training throughput vs. the standard practice of running each job on a separate accelerator.

Thu 8 April 9:50 - 10:10 PDT

A Distributed Graph-Theoretic Framework for Automatic Parallelization in Multi-core Systems

Guixiang Ma · Yao Xiao · Theodore Willke · Nesreen Ahmed · Shahin Nazarian · Paul Bogdan

The rapid demand for memory and computational resources by the emerging complex applications requires multi-core parallel systems capable to scale the execution of these applications. In this paper, we propose a distributed graph-theoretic framework for automatic parallelization in multi-core systems, where the goal is to minimize the data communication while accounting for intrinsic functional interdependence and balancing the workloads among cores to improve the overall performance. Specifically, we design a general and flexible greedy-based vertex cut framework for partitioning LLVM IR graphs into clusters while taking into consideration the data communication and workload balance among clusters. Then, we map the clusters generated by the vertex cut algorithms onto a non-uniform memory access multi-core platform. Experimental results demonstrate that our proposed WB-Libra algorithm provides performance improvements of 1.56x and 1.86x over existing state-of-the-art approaches for 8 and 1024 clusters running on a multi-core platform, respectively.

Thu 8 April 10:10 - 10:30 PDT

Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More

Shabnam Daghaghi · Nicholas Meisburger · Mengnan Zhao · Anshumali Shrivastava

Deep learning implementations on CPUs (Central Processing Units) are gaining more traction. Enhanced AI capabilities on commodity x86 architectures are commercially appealing due to the reuse of existing hardware and virtualization ease. A notable work in this direction is the SLIDE system. SLIDE is a C++ implementation of a sparse hash table based back-propagation, which was shown to be significantly faster than GPUs in training hundreds of million parameter neural models. In this paper, we argue that SLIDE's current implementation is sub-optimal and does not exploit several opportunities available in modern CPUs. In particular, we show how SLIDE's computations allow for a unique possibility of vectorization via AVX (Advanced Vector Extensions)-512. Furthermore, we highlight opportunities for different kinds of memory optimization and quantizations. Combining all of them, we obtain up to 7x speedup in the computations on the same hardware. Our experiments are focused on large (hundreds of millions of parameters) recommendation and NLP models. Our work highlights several novel perspectives and opportunities for implementing randomized algorithms for deep learning on modern CPUs.

Thu 8 April 10:30 - 10:50 PDT

Scaling Polyhedral Neural Network Verification on GPUs

Christoph Müller · François Serre · Gagandeep Singh · Markus Püschel · Martin Vechev

Certifying the robustness of neural networks against adversarial attacks is critical to their reliable adoption in real-world systems including autonomous driving and medical diagnosis. Unfortunately, state-of-the-art verifiers either do not scale to larger networks or are too imprecise to prove robustness, which limits their practical adoption. In this work, we introduce GPUPoly, a scalable verifier that can prove the robustness of significantly larger deep neural networks than possible with prior work. The key insight behind GPUPoly is the design of custom, sound polyhedra algorithms for neural network verification on a GPU. Our algorithms leverage the available GPU parallelism and the inherent sparsity of the underlying verification task. GPUPoly scales to very large networks: for example, it can prove the robustness of a 1M neuron, 34-layer deep residual network in $\approx$ 22 seconds. We believe GPUPoly is a promising step towards the practical verification of large real-world networks.