Skip to yearly menu bar Skip to main content


Hardware Efficient ML

Exhibit Hall A

Moderator: Yingyan Lin


Chat is not available.

Tue 30 Aug. 14:15 - 14:33 PDT

Collapsible Linear Blocks for Super-Efficient Super Resolution

Kartikeya Bhardwaj · Milos Milosavljevic · Liam O'Neil · Dibakar Gope · Ramon Matas · Alex Chalfin · Alex Chalfin · Naveen Suda · Naveen Suda · Lingchuan Meng · Lingchuan Meng · Danny Loh · Danny Loh

With the advent of smart devices that support 4K and 8K resolution, Single Image Super Resolution (SISR) has become an important computer vision problem. However, most super resolution deep networks are computationally very expensive. In this paper, we propose Super-Efficient Super Resolution (SESR) networks that establish a new state-of-the-art for efficient super resolution. Our approach is based on linear overparameterization of CNNs and creates an efficient model architecture for SISR. With theoretical analysis, we uncover the limitations of existing overparameterization methods and show how the proposed method alleviates them. Detailed experiments across six benchmark datasets demonstrate that SESR achieves similar or better image quality than state-of-the-art models while requiring 2x to 330x fewer Multiply-Accumulate (MAC) operations. As a result, SESR can be used on constrained hardware to perform x2 (1080p to 4K) and x4 (1080p to 8K) SISR. Towards this, we estimate hardware performance numbers for a commercial Arm mobile-Neural Processing Unit (NPU) for 1080p to 4K (x2) and 1080p to 8K (x4) SISR. Our results highlight the challenges faced by super resolution on AI accelerators and demonstrate that SESR is significantly faster (e.g., 6x-8x higher FPS) than existing models on mobile-NPU. Finally, SESR outperforms prior models by 1.5x-2x in latency on Arm CPU and GPU when deployed on a real mobile device. The code for this work is available at

Tue 30 Aug. 14:33 - 14:51 PDT

Towards the Co-design of Neural Networks and Accelerators

Yanqi Zhou · Xuanyi Dong · Tianjian Meng · Mingxing Tan · Berkin Akin · Daiyi Peng · Amir Yazdanbakhsh · Da Huang · Ravi Narayanaswami · James Laudon

Better neural architectures and new hardware accelerators are two driving forces for the progress in deep learning. Previous works typically focus on one aspect: they either design new neural architectures for fixed hardware like GPUs or customize hardware (often on FPGAs) for a fixed set of neural models like ResNets or Transformers. In this work, we aim to jointly optimize neural architecture and hardware configurations for Google's Edge TPUs. Through extensive studies, we observe that: 1) the neural architecture search space has to be customized to fully leverage the targeted hardware, 2) neural architecture and hardware accelerator should be jointly searched to achieve the best of both worlds, and 3) conventional metrics such as FLOPs and parameter size often do not well represent model efficiency in real accelerators. Our experiments show that our joint search approach, named NaaS, consistently outperforms previous state-of-the-art results, such as EfficientNet, on both image classification and segmentation tasks. Furthermore, our approach reduces energy consumption by up to 2x under the same accuracy on Edge TPUs.

Tue 30 Aug. 14:51 - 15:09 PDT

On the Utility of Gradient Compression in Distributed Training Systems

Saurabh Agarwal · Hongyi Wang · Shivaram Venkataraman · Dimitris Papailiopoulos

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent research proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 realistic distributed setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide meaningful utility.

Tue 30 Aug. 15:09 - 15:27 PDT

Efficient Strong Scaling Through Burst Parallel Training

Seo Jin Park · Joshua Fried · Sunghyun Kim · Mohammad Alizadeh · Adam Belay

As emerging deep neural network (DNN) models continue to grow in size, using large GPU clusters to train DNNs is becoming an essential requirement to achieving acceptable training times. In this paper, we consider the case where future increases in cluster size will cause the global batch size that can be used to train models to reach a fundamental limit: beyond a certain point, larger global batch sizes cause sample efficiency to degrade, increasing overall time to accuracy. As a result, to achieve further improvements in training performance, we must instead consider "strong scaling" strategies that hold the global batch size constant and allocate smaller batches to each GPU. Unfortunately, this makes it significantly more difficult to use cluster resources efficiently. We present DeepPool, a system that addresses this efficiency challenge through two key ideas. First, burst parallelism allocates large numbers of GPUs to foreground jobs in bursts to exploit the unevenness in parallelism across layers. Second, GPU multiplexing prioritizes throughput for foreground training jobs, while packing in background training jobs to reclaim underutilized GPU resources, thereby improving cluster-wide utilization. Together, these two ideas enable DeepPool to deliver a 1.2 - 2.3x improvement in total cluster throughput over standard data parallelism with a single task when the cluster scale is large.